-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tune search constants #2260
Tune search constants #2260
Conversation
Bench 4024328
I said :
Alayant has quite rightly pointed out that I haven't run this test at stc so there is no indication of stc/ltc difference provided here. That's my prejudice showing through. I do think it is a problem, but this is not a good example. I will leave the original comment alone for now, but obviously I can edit that part out if required. |
@Vizvezdenec has suggested I run another LTC test at [0,3.5] since this has only passed 1 test and that was quite short. This seems reasonable to me, what do others think? |
Seems reasonable to me too. If the patch is as good as it looks, it should have no trouble clearing those bounds, which are lower than for the 1st test. |
A few interesting points on this patch: I am really amazed this passed when the SingularExtension was decreased from depth>= 8 to depth>= 6. |
@VoyagerOne I don't think that changing stats to basically any value does anything at depth 18+. |
@Vizvezdenec yeah I now...just thought it was funny because the number is negative. |
Congrats too! |
I suggest a STC. If it fails STC and passes another LTC. |
A few different ideas here. An stc just for scaling info can be run any time, that's cheap. |
well I suggest [0;3.5] because it already passed stronger SPRT than [0;4] |
I suggest: If it fails then we: |
I like @VoyagerOne's idea. I want to run stc [0,4] anyway, just to see how
it compares, so why not run it first to try to verify the patch. That's
quick and cheap.
If it fails, that's interesting ... run ltc again (not sure if 0,3.5 or
0,4) to see if original run was just lucky or if there is something
interesting going on with scaling.
|
well [0;4] STC will sure hurt no one :) |
STC [0,4] started at http://tests.stockfishchess.org/tests/view/5d4071970ebc5925cf0f9041 Edit: |
The STC isn't doing so well. Let's see if a second LTC run will do better. |
I'm hoping @locutus2 will give the casting vote soon on whether to use [0,3.5] or [0,4.5] ... :) |
@xoto10 |
LTC [0,4] started at http://tests.stockfishchess.org/tests/view/5d407cff0ebc5925cf0f9119 Edit: Similar to/better than first test: |
The first passed LTC used more stringent bounds than normal for parameters patches, so using [0, 3.5] would make more sense imho. As Viz said :
But whatever, the [0, 4] should pass anyway. In any case, the quick fail of the STC (-3.8 elo perf) bolsters @xoto10 concern about TC scaling. As I wrote when we talked in chat, we'd need a breakdown on how fisthest resources are currently used to help on a decision, but a second LTC green here would be a strong argument for increasing STC to 15 or 20 seconds. |
Yeah...if this patch passes LTC, which ATM seems very likely. We need to go back to the drawing board regarding the proper parameters for STC. One idea is to use this patch as a marker to find the proper STC. i.e. what's the shortest time can use while it still passes. My hunch is that reducing the depth for Singular Extension is why we see such scaling.. |
well there are obvious problems in increasing STC TC (although they can be partially fixed by me dropping patch writing frequency twice as I did when we switched to new SPRT bounds). |
This is quite interesting result, second SPRT is showing even better performance. |
Already investigating... |
Yes, I know and this is smth for sure to see. |
I think we should do a VLTC if anything. but anyway, why do people here always want more tests for +5 elo patches, but are fine with +1 elo patches that need 200k games? sounds kinda counter intuitive to me - both have the same luck factor |
I am not suprised at all that we had such a big difference between STC and LTC. Tuning was done at LTC, so parameters are more convenient at LTC than at STC. But to be sure we are not doing something wrong, it will be useful to do classic [0 .. 3.5] LTC test. To go further : it will be interessant to understand what parameters should be variable with search depth. |
10k games at 180+1.8 seems like a very strong test to me, to see whether the changes truly scale or just happen to work well for 60+0.6. |
yes, 20+0.2 is reasonable, IMO. |
Nice suggestion @vondele ! The 8 thread test using values from the tune after 89k games finished with +5 Elo here I would say this outweighs the -1 Elo on single thread here : but obviously we would prefer no loss on 1 thread. The tune has now got to 141k games and last time I checked (around 120k?) the bench had fallen to 25xxxxx. I will record the current values and could start another test but I don't want to run too many as fishtest is busy with various useful looking tests and obviously we will run at least one more with the final values from the tune. Edit: One other factor, these tests were against the futility margin change by viz, but a different futility margin change has been merged into master. |
@xoto10, nice result, kind of confirms that a narrow single-threaded search scales better with lazy smp. Since potentially +5 Elo gain at 8 threads (potentially scaling well with thread count) is quite a thing, I would try to do a few more tests towards merging into master. One approach would be to rebase the tested branch on master, verify that the bench is still low, and do two tests:
if both pass, you open a new PR with the updated values. |
I must say that I am a bit reluctant (more than a bit reluctant, in fact) to commit the second patch with second round of tuning (this set of values: xoto10/stockfish-xoto10@57f88bb...6fc904e ) into master, because it seems like some of the values are just adding complexity over simple ideas, and introducing gratuitous random noise in Stockfish algorithm is not good. I am thinking in particular of these points: • using an array of futility values instead of a linear function • using • using • using • using etc... Each of these changes would be worth a normal SPRT with the normal double test for introducing complexity in the code base, and the patch feels like a combo of these complexity-adding ideas :-( To end on a positive note, the experiments suggests that we indeed see a strange new effect of a difference between single thread results and multiple threads results, and I would propose to understand and then extract the core of the effect. My intuition is that the most important stuff lies in the pruning margins, as usual, so it would be interesting to take our time and re-test only with the updated margins and see if we can observe the same effects. Note that I am completely open to fine-tune heavily strange margins or strange factors if they are already in Stockfish code, it it the introduction of a large set of new strange factors in the code that I want to avoid. Stéphane |
OK. Obviously I disagree. I understand the desire for simplicity, e.g. a mathematician likes a nice neat formula, but PI is not 3, it is 3.141592... and to use 3 is just inaccurate. We don't know what the "best" values of these parameters are, so it seems to me we should just use the values that come out of the tuner, some will matter, and those that don't will be harmless. To force the use of "nice" 0 values (for example) seems like writing an eval function that only uses even values, or multiples of 4, instead of the full resolution available, there's a danger that each rounding loses a small amount of accuracy. Changing X / C to X * A / B is just a practical matter to me, if C is small, changing it by integer amounts is too coarse a change to be realistic. If the extra operation is too expensive then the change won't work, so the test will fail anyway. If there's a gain, I don't think we should be worried by "extra complexity" like this. Anyway, you're the boss, and maybe 80% of the benefit is in a few pure parameter changes, so let's investigate. I have started an 8 thread test with just pure parameter changes. The bench is 3079782 so higher than the 2747391 on the rebased patch, but perhaps this is still enough to show some threaded gains. If not, we will have to look closer at the changes missed out here. As a separate matter I will test the final output of the tune with the aim of getting a single-threaded elo gainer. |
@snicolet @xoto10 While I also appreciate simplicity, I think elo gain is elo gain no matter what. SF is playing chess way way beyond any human comprehension and so I think we should just let fishtest do its job, especially since the multithreaded test showed a very sizable elo gain. If anything, if the elo gain is actually as big as the first test suggested, then we can test this as not a param tweak, but with [0.5, 4.5]. |
I have to agree with @snicolet here. If a patch introduces more code complexity, even something as minor as extra multiplications, it shouldn't be considered a parameter tune. If you accept that premise, then the patch violates the standard rule that every new idea should be tested independently. In more practical terms, it is completely plausible that all the elo from this patch is due to two or three parameter tweaks, and that all the other changes suggested by SPSA are just noise. The lines pointed out by @snicolet might make no difference to sf's search at all, and will just add extra complexity and lose 0.1 elo due to the slowdown. I think @xoto10's test with only the pure parameter tweaks is the way to go. If it passes (or doesn't), the other changes can be tested separately. |
would it be considered a parameter tune if all multipliers were changed to floats? I think it's ridiculous. The slowdown is clearly not a problem. |
This is all a matter of opinion of course, but in my opinion, no. Floating point operations are significantly more expensive than their integer counterparts, so it just isn't the same code anymore. This again violates the rule that different code changes should be tested separately whenever possible. I see that @xoto10 is currently testing the patch both with only the simple parameter changes, and with only the added fractional parameters. These tests will be an important data point in the discussion. |
@MortenLohne Actually the results are completely meaningless if we test them separately since the values were tuned together. I imagine any potential elo gain would be brought about by the two halves working together; in fact I personally wouldn't bother testing them separately. As I said, sure some of the parameters may contribute to the majority of any potential elo gain, but we can't possibly figure out which ones do. The best thing to do is to trust the process and if there's added complexity then test it as such i.e. with [0.5, 4.5]. |
@adentong While that's sort of true in theory, in practice I think it's putting way too much faith in SPSA. It's not a mathematically sound optimization method to begin with (i.e. it does not converge to a maximum point), and expecting it to discover some particular combination of values that synergize perfectly after 89k games is just not realistic. Just look at all the SPSA tunes on fishtest that have appeared to converge really well, only to fail spectacularly to actually gain elo. Since this tune did show elo gain, at least for a particular time control, it must have discovered something. But 90% of the elo probably comes from 10% of the changes, and when implementing it all adds so much complexity, we should add least explore whether that complexity is actually necessary. |
I don't trust SPSA but I do trust SPRT. |
I think its essential to test on 8-cores LTC in order to verify that the elo gain does not derive mainly from the short 20+0.2 TC. Also while the Vondele explanation seems solid, it could be the case that while in short TC the depth is important width is covered from the many threads and heavier pruning is optimal, at VVLTC it could be possible that as the width of the search tree of each thread grows, the work of the threads overlaps with each other more, thus losing effectiveness and with the risk of all of them being blind to a rare pruned move. The tests in general show an extreme TC sensitivity, hence between 2 patches that perform the same at LTC, the favorable one should be the one which performs WORSE at STC (given adequate confidence). But since LTC multithreaded tests are quite expensive, an interesting idea would be to test the candidate setups at VVSTC, VSTC and STC with the intention to identify the scaling tendency, which is probably more crucial than the STC elo gain. The reason for this is that the TC's that we mostly care about (ie TCEC TC) are a multitude of times higher than the STC, so if a patch that starts with +5 elo has a negative scaling curve at some point it would regress, while a patch that starts at -5 elo with a positive curve would bring elo. In general I think that the testing methodology is: a. very effective for neutral scaling, as we get combined STC+LTC confidence A shy workaround for this are are LTC-tuned stuff, privileged to skip the STC cutoff. The raise of the STC is a supported idea that IMO would help but at a high resource cost, additive to the cost of new STC bounds. Another idea is to specifically try to locate scaling irregularities, since this is the problematic field, and this could be done cheaply with a VVSTC test combined with the STC one. So instead of 20+0.2, a combination of 10+0.1 and say 2+0.02 for every candidate patch seems to bring a lot of info at a very low cost (or why not even 8+0.08 and 2+0.02 for no cost at all) : We would have:
I see no drawbacks at all for this method compared to the old one with keeping same bounds, but most probably a reconfiguration of optimal bounds will benefit further. |
The funny thing is that a couple of months ago, after a long and passionate discussion, we "carefully" changed the bounds for the STC and LTC tests, choosing stricter bounds for STC and LTC so that we would have more confidence that the accepted patches would not be regressions. And now people do speculative LTC no matter what, completely ignoring the STC results and almost claiming that it is a better patch if it fails STC! A small contradiction, perhaps? I will tell you: next step will be to do speculative VLTC at 32 threads for each and every patch, and the framework will die. Of course we can slowdown SF development by a factor of 20 very easily just for the fun of it, but we should realize that what we have done in ten years would have been done in two hundred years if we used 20 times more ressources for every test... |
OK, there are multiple problems with the "whole" patch from this second tune; the FutilityMargin array I based it on was not merged but replaced with futility_margin(), I included some contempt changes which are badly formulated (use Color when they should be testing the eval in previousScore), maintainers are not happy with the added multiplications, and maybe more. So I am trying to see what we can gain from this. Pure param changes: an obvious step is to just use the simple param changes in a normal [0,4] patch, unfortunately this failed non-regression on 1 thread badly: Complementing the above, I am testing the added fractions from the tune. This is info only since the contempt changes are badly formed and maintainers don't like adding operations in many places in one patch. |
I am also trying some individual changes from the tune, initially choosing the ones that change the bench the most. e.g. the change to reductions derived from statscore : searchconst3 Perhaps 1 or 2 of these will work as patches on their own. |
maybe a test of all tuned values at STC (the one that passed 8 th) would tell us something. maybe it's a complete fluke and 8 th vs 1 th has the discrepancy only because of time control. |
STC was made harder to pass than it was before, which actually increases the "need" for spec. LTC. Because it is very hard to get a green STC (there is something like 5x more patches ending yellow with over 70K games than getting a green... I didn't collect stats for a long period, but this is roughly correct) ; and because elo gainers were so rare, speculative LTCs have been a common thing. STC is supposedly here to avoid wasting resources in LTC which are going to fail, but if there is no green STC to run at LTC, then it's not really a waste to run spec. LTCs. A good bunch of merged elo gainers this year followed spec. LTCs. One issue around fishtest is that the flexibility is very low. This issue about STC being harder to pass and the consequences has been visible for months, but it's almost impossible to get people to admit the need to rethink it and validate another set of rules which would eliminate the "speculative LTC" use. This search patch, which has shown elo-loss at fishtest STC and massive gain at LTC, also brings the question of 10+0.1 being optimal when it comes to selecting patches to test at a higher TC. |
@snicolet For patches that score similarly on LTC, from the confidence perspective its good to pass STC, but from the scaling perspective the harder it fails the better. It is a very simple concept, suppose we have 100% confidence, one patch scores -5 elo at STC and +5 at LTC, another scores +15 at STC and +5 at LTC. The first gains value with more time, the 2nd loses. Its just completely different conditions. Its natural that optimizations for depth 12 searches will not be optimal for depth 30 ones. So the contradiction is indeed there and has always been, as we are both interested in resource economy by cheaply filtering out stuff and good scaling. In the past the LTC was raised from 40" to 60" in order to be more safe of decent scaling. (+resource increase, + scaling, = confidence) But as we saw there are stuff that give +10 elo at VLTC and score negatively at STC! So by promoting only the stuff which pass the expensive and accurate (0.5,4.5) STC we definitely miss a lot of gems. The recent changes were focused on getting more STC confidence, while relaxing LTC requirements. Regarding scaling we are less safe than before as we kept the same STC elo requirement while lowering the LTC one. (+resource usage,-scaling,+confidence) The initial plan anyway was to give them an evaluation period (with tuning bounds still pending) and I think its time to reconsider and open a new discussion. I have the vision to combine our interests of resource economy and good scaling. |
I think I'm mostly done with this second tune now, there have been a few patches and it seems out-of-date already. I think the idea that a bench well below 3000000 is possible, and gives good results on multiple threads is interesting knowledge (although it is maybe too brittle?), perhaps this will come in useful at some point ...
|
This is the result of a 200k tuning run at LTC: http://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e which passed quickly at LTC: LLR: 2.95 (-2.94,2.94) [0.50,4.50] Total: 12954 W: 2280 L: 2074 D: 8600 http://tests.stockfishchess.org/tests/view/5d3ff3f70ebc5925cf0f87a2 STC failed, but second LTC at [0,4] passed easily: LLR: 2.96 (-2.94,2.94) [0.00,4.00] Total: 8004 W: 1432 L: 1252 D: 5320 http://tests.stockfishchess.org/tests/view/5d407cff0ebc5925cf0f9119 Further work? No doubt some of these changes produce most of the gain and some are neutral or even bad, so further testing on individual/groups of parameters changed here might show more gains. It does look like these tests might need to be at LTC though, so maybe not too practical to do. See the thread in the pull request for an interesting discussion: official-stockfish#2260 Bench: 4024328
This is the result of a 200k tuning run at LTC: http://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e which passed quickly at LTC: LLR: 2.95 (-2.94,2.94) [0.50,4.50] Total: 12954 W: 2280 L: 2074 D: 8600 http://tests.stockfishchess.org/tests/view/5d3ff3f70ebc5925cf0f87a2 STC failed, but second LTC at [0,4] passed easily: LLR: 2.96 (-2.94,2.94) [0.00,4.00] Total: 8004 W: 1432 L: 1252 D: 5320 http://tests.stockfishchess.org/tests/view/5d407cff0ebc5925cf0f9119 Further work? No doubt some of these changes produce most of the gain and some are neutral or even bad, so further testing on individual/groups of parameters changed here might show more gains. It does look like these tests might need to be at LTC though, so maybe not too practical to do. See the thread in the pull request for an interesting discussion: official-stockfish#2260 Bench: 4024328
The discussion seems to have dried, so I close the PR. |
Do you happen to recall (or have written down) the SPSA parameters that you used for this tune? |
probably the default https://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e |
SPSA A=5000 RM,600,0,1200,60,0.0020 |
This is the result of a 200k tuning run at LTC:
http://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e
which passed quickly at LTC:
LLR: 2.95 (-2.94,2.94) [0.50,4.50]
Total: 12954 W: 2280 L: 2074 D: 8600
http://tests.stockfishchess.org/tests/view/5d3ff3f70ebc5925cf0f87a2
STC failed, but second LTC at [0,4] passed easily:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 8004 W: 1432 L: 1252 D: 5320
Further work:
No doubt some of these changes produce most of the gain and some are neutral or even bad, so further testing on individual/groups of parameters changed here might show more gains. It does look like these tests might need to be at LTC though, so maybe not too practical to do.
This is a further indication that LTC is different from STC, we should debate whether we can increase STC slightly to make test results closer to LTC, say to 20+0.2 instead of 10+0.1.