-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] closed positions book. #2646
Comments
This is a very good idea I had suggested many times before ! But just for me pe->blockedcount() >=4 is not enough. Many of positions of the book are not blocked (>80%). Can we add by hand some french and king indian positions and retrieve clearly open positions? Edit: we can allow patchs with this book and test STC non regression with initial book. |
@MJZ1977 , thanks! Some related observations/notes:
|
Thanks for this exciting incentive! Both strategies should be valid, the specialized one would indeed require a non-regression step. Another point is that for open positions search is a nifty tool, so its closed positions which need elements. |
Influence of the book on Elo difference. noob_3moves.epd vs closedpos.epd.
closed:
noob:
closed:
noob:
closed:
noob:
I think this indicates that the book is pretty general purpose. I will now reschedule a few of the recent yellow LTCs that presumably target |
Can I ask authors of recent yellow LTC patches (e.g. @Vizvezdenec @xoto10 @locutus2 @MJZ1977 @Lolligerhans) that target closed positions to resubmit them LTC, with the new closedpos.epd book, putting closedbook in the info field as well? Looks like a few of them will need rebasing so I can't easily reschedule. I've reschedule 2 that were based on current master: |
I will retest with the closed book my pawn chain patches . I had three similiar version which all passed STC and failed LTC yellow. |
@vondele I had no such patch. I kept track of yellows so I am pretty sure. :) |
Unrelated to the current topic, but the last regression was only ~11elo, but @vondele's LTC tests are showing 18/20 elos respectively for closed book/noob book. I know we use a different book for regression, but still a bit surprising. |
Very interesting results! Am i right in thinking this book is about the same size as noob_3moves ? So we've used noob_3moves to play a lot of games, then sampled games we're interested in after 8 plies - is that 14 plies from startpos then? That might be a concern for long-term use as the standard book, but given the performance tests give very similar results to noob_3moves, I'm happy to test it out for a couple of weeks. Definitely a plus point to just update the main book instead of having a choice, and having to do non-regression tests against the main book, I just hadn't expected this to be an option. Interesting ... |
well side note that last RT has different master that was behind by 2 elo patches and one simplification. |
Yea well usually I wouldn't expect a 7-9 elo difference with just two elo gaining patches lol... |
@adentong RT's use 8_moves book, which has the lowest elo spread (around 10% less). This makes the +50 elo between versions more meaningful. On top of that are the 3 patches, an undefined small effect of book optimization, and double error-bars. |
I indeed wouldn't focus to much on the comparison to the RT, it is indeed not exactly the same version of the code, and the 8moves_v3 book is known to yield less Elo difference. The draw rate is slightly different with the books as well 8moves 0.74, noob_3moves 0.73, closedpos 0.70. There have been a number of tests overnight using the new book (on old yellow LTCs): So let's get the expectations right. The closedpos book is not a magic bullet, and it will remain a real challenge to get patches passed. |
Based on the data collected, my proposal is to switch the default book to closedpos.epd relatively soon, used for essentially all tests (but not RT), and just continue testing as before. In particular, after passed STC and LTC tests on closedpos, PRs can be made, no need for additional non-regression tests. After a couple of weeks (June?) this strategy is reassessed. Give thumbs up or down if you agree or disagree with this proposal. |
@vondele But the the best approach seems for me to do a mixed book: 50% positions from closed book and 50% positions from noob book. So we would have the best of two worlds: closed position testing but no overfitting to this type of positions IMO. |
@locutus2 I plan to do the monitoring based on the usual 8moves RT runs. My argument against doing additional non-regression tests is that I want to keep our procedure as simple as possible. I'm also pretty confident that regression are unlikely. But if there is a strong feeling in favor of the additional testing on passed patches, I'm fine with it. So, let's see what the vibes are. I'm not in favor of mixing the books. Let's try to get a clean signal. Again, the book is not extreme, and there will be opinions going in either direction (e.g. @MJZ1977 would like to see it more closed, you prefer a little more open). |
@vondele |
@locutus2 long term I can indeed see the point, and we can reassess. Short term, let's figure out if the book actually matters much. I think this is an experiment to try and see if the perceived weakness in closed positions can actually be more easily fixed with a closed book (if one looks at the positions, it really is not that closed). We might find that this is not as important as we think. This is in part an old discussion, the many years of development with the 2moves book, which really was not very sophisticated, illustrated that the book might not be the key ingredient to progress. |
I think we can keep the 2 books for instance and change the default once we have the ideas clear. It will be interessant to find a patch that shows a big gap between the 2 books. Green to "closed book" and red to "noob book". Then we can conclude. |
Last night I was thinking this was a big development ... now seeing the results of the reruns, it seems it doesn't make much difference at all. Perhaps there is a subtle change that we will become aware of over time. At the moment (very early of course), it seems the lower draw rate is perhaps the main change (benefit?) of this. My main concern if we switch to using this book for the medium term remains the beginning of the game. If we want sf to get better at the early moves, surely we need a test book that includes small ply openings (say 0-5) as well as longer ones? |
The way I understand it is that we get positions which, in its games Stockfish closes the position (please correct me if I misunderstood something). But what about games that Stockfish fails to close the position? For example, when searching from root, very commonly we see the exchange French, etc. Something feels off about it. |
I believe that the beginning of the game is too vague to be helped by eval, due to very high availability of viable options and different setups. But as the midgame eval becomes more accurate, it will show at openings via better steering of search. This book should not be regarded as a specialized closed position book, but as an attempt for a more balanced general book in regards to position type. The conditioning is soft and leads to open positions too. The problem with typical books is that they are balanced in regards to viable opening availability, thus tiny signal of truly closed positions. SF has problem with those for 3 reasons:
Search inefficiency (and unfortunate setup selection) has partly to do with seeking generically favorable evals: A highly valued bonus in a static position acts like a black hole for the search. It sucks up all the resources to that direction, because it "believes" its something supreme, blinding it for alternatives. An example is a very deep knight outpost at totally blocked flank + space advantage. Totally useless at a glance for chess players, but SF aiming for it form early game even. Removing those black-holes completely will require "alien" tech like pattern-recognition, MCTS, NN, or a detailed categorization of cases. But an increased representation of black-hole situations will surely boost long-term health. I don't believe SF needs training at positions that are very easy for it, nor is it in danger of regressing. At tactical cases the various paths are narrow and concrete and search shines. |
Good question. I guess there will be a few d4/e5 French advance structures in this book, perhaps this can be an iterative process and the book can be recreated occasionally? If we can improve sf's blocked position play a little, then it will choose more blocked positions ... then we can improve it's play a little more ... etc Edit: or we could just get some games from somewhere else, no reason to only use fishtest? e.g. http://data.lczero.org/files/match_pgns/1/ |
I believe there have been some valid concerns raised in this thread, enough so that we should consider alternatives. I have now built a new book with a very different approach based on these comments. I'll again do some testing on fishtest later. The major concerns I have seen raised are:
To address this, I made a book based on the frequency of FENs in games played at lichess (restricted to Elo > 1800, TC > 60). I retained the 200k most frequent FENs out of >8M games. (see official-stockfish/books#9) This have the following advantages:
Of course, the choice of the initial database will somewhat influence the resulting FENs, but I think that's more or less secondary. Edit: the Elo testing yielded the following:
So the Elo spread is somewhat small on this book. Anybody has a pointer to another pgn database of high quality games (e.g. master level, ICCF), but it will need to be > 2M games to be suitable to build a book, I would say. Alternatively, a subset of high quality leela training games (again >2M) ? |
noob_2/3moves books were selected to avoid drawish openings IIRC, but the closedpos book just turned out to have a good Elo spread without any explicit drawish checks. (I wonder why?) Do you have any info on how many of these popularpos lines qualify as closed under the closedpos tests? Maybe we need a not-drawish test if we want to consider these popular and more open lines? |
No they were not. In fact their draw ratio is rather high. Note: for the same Elo you want the highest possible draw ratio (= least amount of noise). It you want to lower the draw ratio convert every draw into a win or loss using a coin. |
I ran a second test on a book
the noob_3moves book was not selected specifically to avoid drawish openings, but it might be a side effect of how the database has been constructed. |
My books were built from one simple rule: pick moves that are top N and not worse than a score threshold. |
so average number nodes needed to reach depth 13:
|
Weird, so the theory is right, but the result went the opposite... |
It makes sense now, elo spread is related to the percentage of positions contained in the book may be reached by playing SF topN moves. This is why closedpos had a good spread but popularpos didn't. |
I'm not sure. For example book 2moves_v1 contained basically random sequences of moves and had the same spread as noob_3moves. We measured it end of December and results were as below. Looks like books constructed differently and even with vastly different RMS bias may give the same sensitivity.
|
Well as for 2moves there are just 2 moves, so pretty much anything not losing a pawn's worth is within topN, and it did remove some outright bad moves. |
I added noob_2moves to the table above. Both 2moves books have very little in common it seems. Actually I want now to test hypothesis that positions with bigger depth 13 nodes are more complex. I'm going to sort 12k positions from 2moves_v2 by depth 13 nodes, split it in 3 equal parts and then use 1st and 3rd part as a new books to play 8000 games matches between SF11 and SF10. If it's true that bigger node count mean more complexity, then book made from 3rd part should give significantly bigger spread than the first one. It would be interesting to either confirm or debunk it. Unfortunately I have only a measly laptop, so it may take some time before I get back with the results. |
The difference between my 2moves and 3moves book are just making one move that is not too bad and my scores are back propagated, but still I think coverage ratio among topN matters, spread of 2moves_v1 might because of higher RMS matters only for a few moves in but not more. |
I have #W # L #D (White POV) for the noob_3moves from fishtest LTCs. Typically looks like: "rn1qkbnr/ppp2ppp/3p4/4pb2/2PP1P2/8/PP2P1PP/RNBQKBNR w KQkq -": [
59,
48,
215
],
"rnbqkb1r/pp1pppp1/2p2n1p/8/3P1P2/8/PPPBP1PP/RN1QKBNR w KQkq -": [
38,
27,
186
],
"rnbqkbnr/2pp1ppp/1p6/p3p3/8/3P4/PPPNPPPP/1RBQKBNR w Kkq -": [
25,
44,
233
],
"rn1qkb1r/pbpppppp/5n2/1p6/8/PP4P1/2PPPP1P/RNBQKBNR w KQkq -": [
39,
35,
226
], So, openings appear winnable from both sides. I don't directly see a pattern. @vdbergh do you think that this data be used to select good positions for a book ? |
A lot of 150K-350K eval yellows recently. Maybe check them on closedpos? Also with too many tests + low success rate, eventually some will pass out of luck. With a closer examination of the best performers the harvesting will be safer. Atm it seems to me that too many resources are used on an extreme amount of different versions on very low pass rate, and thus a higher confidence would be logical. |
closedpos will not make them pass, the LTC bounds are very narrow, it is expected to take large number of games to resolve for patches fall within this elo diff range. This is the price to pay so that less patches pass by luck. Low success rate and too many similar tests cannot be solved by lowering the bar while I'm colorblind so that I cannot tell the difference between a yellow and a red SPRT test. |
@noobpwnftw I want less patches to pass by luck, not more. Atm the pass rates are extremely low, but the amount of tested patches is huge, so inevitably the quality decreases & resources are wasted. For colorblind purposes the yellow can be regarded as red without lowering the elo bar but with an even higher amount of games. A higher spread will enable better performance. closedpos had equal spread at STC but +2.7 at LTC, a very good indication. So it might not make them pass as you say, but it can make them fail faster! |
I hope so but with the large number of games their elo measurement is actually very accurate, they do fall around +0.5 range and they would still cost similar resources to conclude, and book probably won't change that. |
Well at this point maybe even a +0.5 at worst is nice. Using millions of LTC games for little gain feels ineffective. What if without you? I also think that testing many versions of same patches with slight changes is bad practice. One might get lucky in the end, worth 0.5, but at a very high price. Btw I like the system more than ever, but I think its very beneficial to keep evolving it, not only SF. |
For that then I think it is important to understand how to manipulate elo spread. This is my scored list of all unique positions after 2 moves without any filtering: I think I have calculated scores for any position up to 4 moves but the data is quite large. |
@noobpwnftw could you make that scores data available for 3moves ? Either all if less than a few GB, or just for the positions in the noob_3moves book ? That will be interesting to correlate with ' z=(w-l)/sqrt(w+l)' |
So, I locally did a test, splitting the noob_3moves according to the abs( (w-l) / sqrt(w+l)) > 0.167 (roughly 1 sigma), and there is no measurable difference (60k games) between the low and high parts of the book. So I start suspecting the broad Gaussian is just the noise, and the feature near 0 is the signal.... this is using the results of 44M LTC fishtest games using the noob_3moves book. |
@vondele Full scores of positions after 3 moves: https://www.chessdb.cn/downloads/3moves_scores.zip |
The feature around -15 and 0 are probably caused by the way I calculate things, might actually be smooth but doesn't matter when you sample moves with a wider range. |
No difference in my tests between book created from positions with low or high node count on depth 13 (TC 10+0.1). |
so, with #2670 we have a first patch that resulted from the closedpos book. Let's call this a success :-) I don't think we have particular evidence to change the default book, but I'm sure we now know that we still don't know quite a few things about opening books. I'll thus close this issue, keeping noob_3moves the default book. The other books can be used as non-default books, either for experimenting or to create Elo gainers, but we'll test patches for non-regression against noob_3moves to gather experience with this setup, asserting that we prefer generic solutions rather than specialized ones. |
See also: https://tests.stockfishchess.org/tests/view/5eb1e2dd2326444a3b6d33f9 #2662 :) |
OK, I overlooked that... should have been in the PR a little more clearly ;-). Extra credit for the book. |
I have made a pull request to the official book repo with a closed positions book.
official-stockfish/books#8
this still needs some testing, but should eventually be available.
I first want to do some testing comparing this to the noob_3moves book on fishtest before we possibly start using this, so that we have a feeling for its quality. My initial impression is rather good.
There are several options we can first discuss here before I decide on this.
The text was updated successfully, but these errors were encountered: