-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misevaluated endgame patterns #2288
Comments
According to syzygy-tables.info, only 28.1% of KBPPvKRP positions are draws. What kind of rule(s?) would you use to tell if such a position is drawing? |
Looking at the overall win/draw rates for a material setup is not very informative, unless this setup is close to 100% won or drawn ; especially as syzygy bases contain a lot of weird positions which are extremely unlikely to happen in a game. The result of that kind of endgame depends on how advanced the pawns are, the files they are in, how pieces can defend each other... But while we can't catch all cases, there are some simple heuristics which can improve things significantly. |
While Perhaps some tool could identify endgame positions (maybe even in a tablebase) where pawn breakthroughs are impossible and Stockfish yields different evaluations depending upon the half-ply counter. Separately... perhaps in an endgame if pawn breakthroughs are impossible and the capture history indicates that captures are bad, something could be done to the endgame scaling factor (EDIT: or #2298 is an even better idea, and it even gains Elo). |
well at least part of this patterns are partially hit with latest patch of mine. |
I'm not really sure how to test these kinds of patches, but it seems even in relatively simple rook endgames, there can be some improvements found: http://tests.stockfishchess.org/tests/view/5d8333840ebc5971531d3abb But alas, LTC failed: |
I hoped the LTC would pass, but yeah it didn't... I made a more simple change for NNB vs R(Ps) endgames, which didn't make enough of a difference to pass at fishtest. These endgame knowledge improvements are hard to pass through fishtest, as you'd need a big chunk together to make enough difference. I also suspect that the effect of those in middlegame move selection would be more important at longer TC (ultra-bullet doesn't get enough depth for this), but this is too impractical to test. How much adding these additional conditionals hurts nps for those positions isn't all that clear to me either. So, how to proceed... ? I don't know yet, but I'll keep adding patterns to the list as I see them. |
8/5p2/8/2b2p2/2N2k2/5p2/8/5K2 b - - 5 51 |
the drawn position arose from 8/5p2/8/5p2/Nb2pk2/5P2/5K2/8 b - - 0 48, and the winning move was only found when infinite analysis was turned on... otherwise it goes for what it thinks is a +4.1 but is actually just a drawn endgame |
@kelvinwop, I do not see what you see, that with even just one thread, Ba3 wins and is found in 3 seconds on my machine, one core with 64M hash. with EGTB it's found even faster also, you state the "...winning move was only found ..." and later say "...is actually just a drawn endgame .." that sounds contradictory to me.. and the position has 10 moves that win easily , did you post an incorrect FEN perhaps?
and the position has 10 moves that win easily , did you post an incorrect FEN perhaps?
also your first position is not draw either "8/5p2/8/2b2p2/2N2k2/5p2/8/5K2 b - - 5 51 5 different moves win 56 +100.00 3.72G 1:01.36 f6 Na5 Kg4 Nc4 f4 Nd2 Be3 Ne4 Bd4 Nd6 f2 Ne4 Kf5 Nd6+ Ke6 Nb5 Kd5 Nc7+ Ke4 Ke2 Bb6 Nb5 Bc5 Nc3+ Kd4 Nb5+ Kd5 Nc7+ Kd6 Nb5+ Kc6 Nc3 Bd4 Ne4 Kd5 Nd2 Ke5 Nc4+ Kf5 Nd6+ Kg6 Ne4 Bb6 Nd6 Bc5 Ne4 Bd4 Kf1 Bb6 Nd6 Be3 Nc4 Bc5 Nd2 Kf5 K |
right, I was recommended bishop A5 for some reason and then after following it for about 100 moves or so, the evaluation eventually dropped to 0 |
I'm following these lines you sent, but even with a search depth of 70 half-moves the evaluation is still at -3.4. |
this second line is also stuck at -3.4 Yeah it seems impossible for black to win. His bishop can't force the white king off the promotion square so he can only move back and forth forever. I guess its possible to make a generalized checker for this pattern:
|
Another idea I had is generally when you're winning, the opponent doesn't really have counterplay and your advantage keeps increasing. Perhaps looking at d/dx advantage could be useful, as an example d/dx advantage in these drawn positions is always zero or slightly negative (I've evaluated the picture above to 76 moves, and it says move 78 is now at -3.3 instead of -3.4) so maybe that can be used to generalize the drawn endgame patterns. |
Your position is winning through zugzwang. The knight can't flee forever, and once he's down, the white king will be forced to leave the promotion square. This is a tablebase win, there is not any doubt about the result or the best moves. |
@kelvinwop Stockfish will eventually see the win here in your examples which is correct. Cyclic zugzwang positions are one the most difficult concepts to understand in chess as they are often mistaken for fortresses where no fortress exists. Through a series of repetitive like moves, the winning side can force the losing side to make a move that breaks the fortress and the winning side then wins easily. Here's another example of a cyclic zugzwang position where the solution is not quite as far out in the number of moves required to break the fortress as it is in your examples, which might be be useful for you to study.
|
I’m not sure what your settings are , but Stockfish without etgb and with default settings does find the winning moves given enough time. |
fyi, I wrote some end game positions generator some time ago and can generate 1000's of games given a certain endgame. Just let me know what you'd like. |
I can pick one and work on it. How about KNNKP? This seems like a draw unless the pawn can promote before a knight gets it. A strict eval on piece values probably doesn't do anything. |
I don't really think it's any use to improve KNNKP. |
TBs aren't used at fishtest, so though endgame knowledge is not very relevant in tournaments using TBs, it can influence testing games (plus some people don't use TBs, etc). The difficulty is that any single game pattern probably doesn't bring enough elo to do a clean pass at fishtest. Today, I taught Ethereal that a single minor piece has virtually no hope of winning against one or more enemy pawns. This covers KNKP, KBKP, KNKPP, KBKPP... where Ethereal frequently displayed eval between +1 and +2.5. Elo gain ? About +2 elo at STC and LTC. Most 5-6-7 men patterns that we can directly code rules for are not as frequent, or as clear-cut, or as egregious misevals. For KNNKP, the way SF overevaluates the drawn positions is poor (the current code is good to do the correct moves to win if it's winning, not to see from afar if the position is good or not), but a complete fix might gain 0.5 elo or so, something too small to be measured alone at fishtest. Even though good endgame knowledge helps to make better middlegame moves, fishtest games are often decided by the time a depth 15 search has many "hits" in 5 or 6-men positions. So, one would need to do reliable code for different endgames through individual specialized testing (there, your position generation @protonspring is very useful), then ensure that if the eval transition is not smooth, it's only because of an almost-certain win/draw (i.e., if the specialized code isn't able to clearly determine the status of a position, you don't want to have a brutal eval jump from previous positions in a search tree), then test a bundle in fishtest hoping for the small parts to all add together for a measurable gain. I have been wondering for a time too if some derivative of this approach could yield benefits for endgame knowledge:
|
What if when running the fishtest, instead of starting from normal starting position, you start from positions you know can simplify into these "bad" endgames |
The problem with this that this heuristics will still affect other games that can not be simplified into such endgames. |
I wrote a different KNNKP ending and it beats master 3 to 1. So far, only 50k games, but looks good so far. I will keep going and post my own PR after I get my "best" version. I will also include how other can test to verify my results. Either way, it looks like there is MUCH room for improvement in these endgames. EDIT: My numbers were wrong because I had resign on. |
fyi, #2386 |
Here is another improved KNNKP. #2553 |
Just so this doesn't get lost: some of these endgames also came up https://groups.google.com/forum/#!topic/fishcooking/B9kp77iiGdE with e.g. KRPvKBP and KRPPvKRP being misplayed often. I did some testing, and at short TC (1+0.01), starting from books with just these two engames in a +- 50/50 win/draw mix as given by TB, master is 50 Elo worse than master with TB. I did work a bit on KRPvKBP and it is possible to come up with a version that is 15-20Elo better than master on these engames:
but that's not enough to pass STC and LTC. Corresponding tests are here: There was also an interesting test (@joergoster) on the value of current endgame knowledge: In this context, there is data showing that full 6men syzygy on top of master is only about 20Elo at STC conditions: https://github.com/glinscott/fishtest/wiki/UsefulData#elo-gain-using-syzygy On the other hand, I feel that with the availability of all this data (TB, played games, ...), somehow it must be possible to extract some knowledge in a way that should benefit gameplay. |
If you have a patch that doesn't regress against master, but is demonstrably better at any specific endgame, i think it can be committed. I would make a pr and let others verify.
|
tbh I think that this endgame stuff means less and less with bigger depth and especially when TBs are added. |
7-men is not accessible for the vast majority of users, but 7-8-9 men positions are extremely relevant for analysis as they often end up as leaf nodes of deep searches and guide critical choices in the middlegame. It is conceivable that say improving play in KRPPKRP can also be done with eval methods that would benefit rook endgames with even more pawns on board. For example, something evaluating how many tempi a rook needs to attack a pawn could be very useful. Many winning positions involve a rook being one tempi too late to stop promotion without being lost for the pawn, while in many drawn positions that are wrongly evaluated as very good, the weak side isn't too late. A specialized eval function for a single combination of pieces has the downside of being rather narrow. Generally speaking, the less pawns left on the board and the more we find erratic patterns that deviate a lot from regular eval. In theory, though, an upside of a specialized term is that when the main eval isn't tasked with not being too wrong in peculiar positions, it can reach a better optimum elsewhere. Like when after a new eval term is introduced, there is some elo in tuning the related PSQTs. If we consider that SF is used wihtout TB as a chess tool by many websites, specialized knowledge can also make it give better advice to players. Whenever possible, changes that can target a whole class of material combinations rather than a single one are more interesting. |
It's simplier to be said then done. |
Sure, some machine learning might really be useful, not necessarily NN. The patch I mentioned above for KRPvKBP was written using some simple ML. I'll try to do something similar for KRPPvKRP in the coming weeks. |
It looks like even the 6-man bases have a hard time to catch up with all endgame knowledge!
|
It would be great if we could test individual eg books on the framework. 😀
|
To test individual endgames, I think right now best done at home, we need some books and stats. https://www.dropbox.com/s/b2i63tzgwi1h39v/material_key_books.zip?dl=0 contains a collection of FENs, ordered by material key. These FENs have been extracted from a few million Stockfish LTC testing games, collecting positions with a given material key, 9 pieces or less, one position per key per game, and only for keys that are on the board for 6 plies or moe. The could serve as testing books for certain material counts, and give an indication of importance of certain combinations. There is a README.txt with full statistics (i.e. counts), but a summary is below:
|
For the endgames above, I used these positions to have master play against master+table bases (6men + relevant 7men), 1000 games at short TC (1.0+0.01). It does show that this set of positions is quite biased to drawing position, but more importantly, this gives an idea which endgames might have most potential for improvement:
|
I tried to quantify the effect of using selected 7men TB for playing games (tb7 = complete 6men +
I'm wondering if this can be reproduced (@joergoster ?), but my setup seems legit. Earlier results indicated that at STC the benefit of TB is larger than at LTC (6men, but in RAM, see https://github.com/glinscott/fishtest/wiki/UsefulData#elo-gain-using-syzygy). |
Hi, I'm curious if there's an update on this issue. Are there a plans for example to make endgame books available on fishtest or to have a separate method (testing against TBs?) or do we still rely on sprt for now? |
current master finds mate 10 quite quickly, same as sf11 |
@kelvinwop Your screenshots crops the FEN string, I think you might have gotten struck by the 50 move rule, but you have to verify when was the last capture or pawn push. |
@miguel-l You can test with the endgames.epd book. |
I'll close this, needs a fresh investigation after NNUE has been tuned. |
Endgame positions are used from far away during search and can have a big impact on finding the good moves in the middlegame.
So just like having TBs give a small boost strength, additional endgame knowledge should result in strength increase.
The way we currently have only 0.00 as a value for draws, instead of a range inside which the eval will sit for theoretical draws to still allow to prefer the "good side" of the forced draw (and playing more challenging moves even if it's a known forced draw) ; is a limitation, but it doesn't make additional draw/win pattern detection useless.
I think it is much more practical to maintain a comprehensive list here than over at fishcooking.
The bishop protected by a pawn and blocking an enemy pawn
For example, the following position is dead drawn but has static eval around -1.8, and searching to depth 50 or 60 with 6-men TB doesn't fix the blindness as it just shuffles (Leela correctly sees it as dead drawn) :
8/8/5kBp/4rP1P/8/3K4/8/8 w - -
This 6-men position is even worse when it comes to static blindness (the resulting KPK endgame is drawn) :
8/8/5kBp/2r4P/8/3K4/8/8 w - - 0 1
However, there are positions where the resulting KPK endgame is won for the side with the rook e.g. 8/8/8/6k1/6Bp/4K2P/7r/8 w - - 0 1 ; so this needs some care to detect properly.
Update : Viz's initiative patch make things better here.
All pawns on the same side facing each other. The 2v1 and 3v2 pawns setup on the same side are usually very drawish.
In this dead drawn position : 6k1/8/6p1/2n2p2/7P/2B2PP1/5K2/8 b - - 1 69
Redfish, running on monster hardware and with 6-men TB, still evaluates the position over +1
There are many problematic patterns in these endgames.
Here is one that is significantly overevaluated, with one side having a "passed pawn" that can in practice never be pushed : 8/2k5/4R3/Pp6/1P6/6r1/1K6/8 b - - 0 1
Latest fish (without TBs) gives me +1.17 for white there at depth 65.
This pattern of a passed pawn supported by a blocked pawn and which can never be pushed occurs in a number of other misevaluated endgames.
KNN is drawn and KNNKP is often drawn too, so this makes trading off the rook for the bishop a very potent threat for the weak side. KBNNKR is drawn in 80% of the positions in syzygy tables. Sadly, I can't check all the cases that are winning, but the king being in the corner and vulnerable, or the rook being quickly capturable, seem to be the most relevant case.
Example drawn position : k7/2BK4/3N4/1N6/8/8/7r/8 w - -
TB Draw, SF depth 63/104 says +3.24 still.
Fixed with #2553
There is no check whatsoever on the pawn/king combined positions, which means that without TB, SF frequently has extremely inflated evals.
Example position : 8/3k4/8/3N4/8/6K1/1p6/1N6 b - - 4 6
+7 for the side with the knights (!) with seldepth 100 (it stops at the 50mr barrier)
All pawns on the same side touching the edge, the edge pawn being ahead for the bishop side, and both pawns on the same color as the bishop. Draw exploiting a fortress.
Position : 8/6pk/4Kb1p/7P/6P1/2R5/8/8 w - - 0 1
SF evaluates this as +2.69 at depth 100 with 5-men TB on.
Note that a similar position having 3 pawns for each side is completely winning, so a fix for the previous fortress should not damage the evaluation of that sort of position in the process.
Position : 5k2/5p2/3Kb1p1/7p/7P/2R3P1/5P2/8 w - -
SF's static eval evaluates this as winning for white, and a deep search confirms this is right.
Often winning, but can sometimes be drawn with pawns on both flanks.
Position : 8/6k1/p3R3/P7/7P/5r2/2K5/8 w - -
Similar positions arise with 7 or 8 pieces on the board
A case of "being up a passed pawn is not enough". The weak side needs accurate play to hold.
Position : 8/8/5NK1/2r2nP1/1k6/8/4R3/8 w - -
- [ ] 9th example SCB endgames with no passer and weak pawns
8/4k3/4p3/3pPp2/1b1P1P2/4K1B1/8/8 w - -
Black is winning because there are two weak blocked pawns on the same color as the bishops. If there was only one, it would be a draw. SF's static eval absolutely suck at telling them apart, and while it can find the correct continuations once it's on the board, it won't be able to guide search from far away towards such a position.
The text was updated successfully, but these errors were encountered: