Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune search constants #2260

Closed
wants to merge 1 commit into from
Closed

Conversation

xoto10
Copy link
Contributor

@xoto10 xoto10 commented Jul 30, 2019

This is the result of a 200k tuning run at LTC:
http://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e

which passed quickly at LTC:
LLR: 2.95 (-2.94,2.94) [0.50,4.50]
Total: 12954 W: 2280 L: 2074 D: 8600
http://tests.stockfishchess.org/tests/view/5d3ff3f70ebc5925cf0f87a2

STC failed, but second LTC at [0,4] passed easily:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 8004 W: 1432 L: 1252 D: 5320

Further work:
No doubt some of these changes produce most of the gain and some are neutral or even bad, so further testing on individual/groups of parameters changed here might show more gains. It does look like these tests might need to be at LTC though, so maybe not too practical to do.
This is a further indication that LTC is different from STC, we should debate whether we can increase STC slightly to make test results closer to LTC, say to 20+0.2 instead of 10+0.1.

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019

I said :

This is a further indication that LTC is different from STC, we should debate whether we can increase STC slightly to make test results closer to LTC, say to 20+0.2 instead of 10+0.1.

Alayant has quite rightly pointed out that I haven't run this test at stc so there is no indication of stc/ltc difference provided here. That's my prejudice showing through. I do think it is a problem, but this is not a good example. I will leave the original comment alone for now, but obviously I can edit that part out if required.

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019

@Vizvezdenec has suggested I run another LTC test at [0,3.5] since this has only passed 1 test and that was quite short. This seems reasonable to me, what do others think?

@Alayan-stk-2
Copy link

Seems reasonable to me too. If the patch is as good as it looks, it should have no trouble clearing those bounds, which are lower than for the 1st test.

@VoyagerOne
Copy link
Contributor

A few interesting points on this patch:
Our bonus stats will be -8 if depth >17 (that doesn't make much sense)

I am really amazed this passed when the SingularExtension was decreased from depth>= 8 to depth>= 6.

@Vizvezdenec
Copy link
Contributor

@VoyagerOne I don't think that changing stats to basically any value does anything at depth 18+.
At some point sg made a test, even depths 17-16-15 don't bring a lot of elo.
Singular extension is for sure interesting, it also touches LMR because of singular extension LMR.

@VoyagerOne
Copy link
Contributor

@Vizvezdenec yeah I now...just thought it was funny because the number is negative.

@locutus2
Copy link
Member

Congrats too!
I think too that a second LTC for verification should be done (under normal conditions). Also the scaling question is interesting so i would do also a standard STC for comparison. Or if we want invest the resources do instead of LTC even a VLTC to get the scaling.

@VoyagerOne
Copy link
Contributor

I suggest a STC.
We can use that as a data point.

If it fails STC and passes another LTC.
We may have to go back and reexamine the bounds we use for STC.

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019

A few different ideas here. An stc just for scaling info can be run any time, that's cheap.
@locutus2 you say "under normal conditions", are you thinking of running ltc at [0,4] instead of viz's suggested [0,3.5]"?

@Vizvezdenec
Copy link
Contributor

Vizvezdenec commented Jul 30, 2019

well I suggest [0;3.5] because it already passed stronger SPRT than [0;4]
[0;4] [0;4] is easier to pass than [0.5; 4.5] and [0; 3.5] so forcing patch to kill [0.5; 4.5] AND [0;4] is kinda overkill.
Also it's nothing new for us to accept tests that pass LTC twice even if they are red at STC.
To be totally honest I consider STC test to be useless by this point because if it passes it's "okay, we passed LTC anyway", if it fails we will "retest at LTC, because it's an LTC tuning".
Maybe smth like 120+1.2 [0; 3.5] is the way to go but it's costly. I honestly prefer normal 60+0.6 since we don't have a single heuristic that does smth specific for depths that can be reached at 120+1.2 but can't be reached at 60+0.6, so there shouldn't (logically) be any scaling issues while a lot of stuff is different for STC and LTC in search.

@VoyagerOne
Copy link
Contributor

I suggest:
Run STC [0, 4]. It's cheap and personally really curious how it fares.
If it passes we can accept the patch.

If it fails then we:
Run another LTC at [0,4].
If it passes we can accept the patch.

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019 via email

@Vizvezdenec
Copy link
Contributor

well [0;4] STC will sure hurt no one :)

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019

STC [0,4] started at http://tests.stockfishchess.org/tests/view/5d4071970ebc5925cf0f9041

Edit:
STC failed rapidly:
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 10152 W: 2237 L: 2362 D: 5553

@adentong
Copy link

The STC isn't doing so well. Let's see if a second LTC run will do better.

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019

I'm hoping @locutus2 will give the casting vote soon on whether to use [0,3.5] or [0,4.5] ... :)

@VoyagerOne
Copy link
Contributor

@xoto10
Please do [0,4] that is what we use for tuning parameter patches..

@xoto10
Copy link
Contributor Author

xoto10 commented Jul 30, 2019

LTC [0,4] started at http://tests.stockfishchess.org/tests/view/5d407cff0ebc5925cf0f9119

Edit: Similar to/better than first test:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 8004 W: 1432 L: 1252 D: 5320

@Alayan-stk-2
Copy link

The first passed LTC used more stringent bounds than normal for parameters patches, so using [0, 3.5] would make more sense imho. As Viz said :

[0;4] [0;4] is easier to pass than [0.5; 4.5] and [0; 3.5] so forcing patch to kill [0.5; 4.5] AND [0;4] is kinda overkill.

But whatever, the [0, 4] should pass anyway.

In any case, the quick fail of the STC (-3.8 elo perf) bolsters @xoto10 concern about TC scaling. As I wrote when we talked in chat, we'd need a breakdown on how fisthest resources are currently used to help on a decision, but a second LTC green here would be a strong argument for increasing STC to 15 or 20 seconds.

@VoyagerOne
Copy link
Contributor

VoyagerOne commented Jul 30, 2019

Yeah...if this patch passes LTC, which ATM seems very likely. We need to go back to the drawing board regarding the proper parameters for STC.

One idea is to use this patch as a marker to find the proper STC. i.e. what's the shortest time can use while it still passes.

My hunch is that reducing the depth for Singular Extension is why we see such scaling..

@Vizvezdenec
Copy link
Contributor

well there are obvious problems in increasing STC TC (although they can be partially fixed by me dropping patch writing frequency twice as I did when we switched to new SPRT bounds).
This is a thing to discuss and investigate of course...

@Vizvezdenec
Copy link
Contributor

This is quite interesting result, second SPRT is showing even better performance.
Honestly I think I've never seen such a big difference between SPRT results for STC and LTC. This needs further investigation :)

@VoyagerOne
Copy link
Contributor

Already investigating...
See http://tests.stockfishchess.org/tests/view/5d408ffb0ebc5925cf0f9193

Vizvezdenec referenced this pull request in xoto10/stockfish-xoto10 Jul 30, 2019
@Vizvezdenec
Copy link
Contributor

Already investigating...
See http://tests.stockfishchess.org/tests/view/5d408ffb0ebc5925cf0f9193

Yes, I know and this is smth for sure to see.

@Sopel97
Copy link
Member

Sopel97 commented Jul 30, 2019

I think we should do a VLTC if anything.
current results suggest very good scaling with time so it should pass fairly quickly unless something weird is going on

but anyway, why do people here always want more tests for +5 elo patches, but are fine with +1 elo patches that need 200k games? sounds kinda counter intuitive to me - both have the same luck factor

@MJZ1977
Copy link
Contributor

MJZ1977 commented Jul 30, 2019

I am not suprised at all that we had such a big difference between STC and LTC. Tuning was done at LTC, so parameters are more convenient at LTC than at STC. But to be sure we are not doing something wrong, it will be useful to do classic [0 .. 3.5] LTC test.

To go further : it will be interessant to understand what parameters should be variable with search depth.

@MortenLohne
Copy link

10k games at 180+1.8 seems like a very strong test to me, to see whether the changes truly scale or just happen to work well for 60+0.6.

@xoto10
Copy link
Contributor Author

xoto10 commented Aug 20, 2019

@Sopel97 Agreed, I think that is interesting info at little cost. I will start stc tests for the 45k and 89k tunes.

@vondele What tc were you thinking of for 8 threads, 20+0.2 ?

@vondele
Copy link
Member

vondele commented Aug 20, 2019

yes, 20+0.2 is reasonable, IMO.

@xoto10
Copy link
Contributor Author

xoto10 commented Aug 22, 2019

Nice suggestion @vondele ! The 8 thread test using values from the tune after 89k games finished with +5 Elo here
20+0.2 th 8:
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 11706 W: 1906 L: 1719 D: 8081

I would say this outweighs the -1 Elo on single thread here :
60+0.6 th 1 :
LLR: -2.95 (-2.94,2.94) [0.00,4.00]
Total: 21370 W: 3553 L: 3638 D: 14179

but obviously we would prefer no loss on 1 thread.

The tune has now got to 141k games and last time I checked (around 120k?) the bench had fallen to 25xxxxx. I will record the current values and could start another test but I don't want to run too many as fishtest is busy with various useful looking tests and obviously we will run at least one more with the final values from the tune.

Edit: One other factor, these tests were against the futility margin change by viz, but a different futility margin change has been merged into master.

@vondele
Copy link
Member

vondele commented Aug 22, 2019

@xoto10, nice result, kind of confirms that a narrow single-threaded search scales better with lazy smp. Since potentially +5 Elo gain at 8 threads (potentially scaling well with thread count) is quite a thing, I would try to do a few more tests towards merging into master.

One approach would be to rebase the tested branch on master, verify that the bench is still low, and do two tests:

  • a repetition of the multithreaded test (20.0.2@8 [0,4]) to make sure it is not a fluke, and the result still holds after rebasing
  • a non-regression test single-threaded LTC (60.0+0.5@1 [-3, 1] to make sure we retain single-threaded quality.

if both pass, you open a new PR with the updated values.

@snicolet
Copy link
Member

I must say that I am a bit reluctant (more than a bit reluctant, in fact) to commit the second patch with second round of tuning (this set of values: xoto10/stockfish-xoto10@57f88bb...6fc904e ) into master, because it seems like some of the values are just adding complexity over simple ideas, and introducing gratuitous random noise in Stockfish algorithm is not good.

I am thinking in particular of these points:

• using an array of futility values instead of a linear function

• using *contHist[1])[movedPiece][to_sq(move)] >= 8
instead of *contHist[1])[movedPiece][to_sq(move)] >= 0

• using make_score(ct, ct * 133 / 256) for contempt instead of make_score(ct, ct / 2)

• using totBestMoveChanges = totBestMoveChanges * 118 / 256 instead of totBestMoveChanges = totBestMoveChanges / 2;

• using beta = (alpha * 123 + beta * 121) / 256; instead of beta = (alpha * 123 + beta * 121) / 256;

etc...

Each of these changes would be worth a normal SPRT with the normal double test for introducing complexity in the code base, and the patch feels like a combo of these complexity-adding ideas :-(

To end on a positive note, the experiments suggests that we indeed see a strange new effect of a difference between single thread results and multiple threads results, and I would propose to understand and then extract the core of the effect. My intuition is that the most important stuff lies in the pruning margins, as usual, so it would be interesting to take our time and re-test only with the updated margins and see if we can observe the same effects.

Note that I am completely open to fine-tune heavily strange margins or strange factors if they are already in Stockfish code, it it the introduction of a large set of new strange factors in the code that I want to avoid.

Stéphane

@xoto10
Copy link
Contributor Author

xoto10 commented Aug 22, 2019

OK. Obviously I disagree. I understand the desire for simplicity, e.g. a mathematician likes a nice neat formula, but PI is not 3, it is 3.141592... and to use 3 is just inaccurate. We don't know what the "best" values of these parameters are, so it seems to me we should just use the values that come out of the tuner, some will matter, and those that don't will be harmless. To force the use of "nice" 0 values (for example) seems like writing an eval function that only uses even values, or multiples of 4, instead of the full resolution available, there's a danger that each rounding loses a small amount of accuracy. Changing X / C to X * A / B is just a practical matter to me, if C is small, changing it by integer amounts is too coarse a change to be realistic. If the extra operation is too expensive then the change won't work, so the test will fail anyway. If there's a gain, I don't think we should be worried by "extra complexity" like this.

Anyway, you're the boss, and maybe 80% of the benefit is in a few pure parameter changes, so let's investigate. I have started an 8 thread test with just pure parameter changes. The bench is 3079782 so higher than the 2747391 on the rebased patch, but perhaps this is still enough to show some threaded gains. If not, we will have to look closer at the changes missed out here. As a separate matter I will test the final output of the tune with the aim of getting a single-threaded elo gainer.

@adentong
Copy link

adentong commented Aug 23, 2019

@snicolet @xoto10 While I also appreciate simplicity, I think elo gain is elo gain no matter what. SF is playing chess way way beyond any human comprehension and so I think we should just let fishtest do its job, especially since the multithreaded test showed a very sizable elo gain. If anything, if the elo gain is actually as big as the first test suggested, then we can test this as not a param tweak, but with [0.5, 4.5].
Now, I do also agree that some of these values seem strange and may just be random noise. The problem is, us humans can't possibly know with certainly which of those values are noise and which are not. Perhaps some of those strange values actually contribute to the elo gain (very likely imo), so we shouldn't simply reject them. The current test with only the param tweaks supports my viewpoint imo since it's approaching 10k games and doing very mediocre at best compared to the previous test.

@MortenLohne
Copy link

I have to agree with @snicolet here. If a patch introduces more code complexity, even something as minor as extra multiplications, it shouldn't be considered a parameter tune. If you accept that premise, then the patch violates the standard rule that every new idea should be tested independently.

In more practical terms, it is completely plausible that all the elo from this patch is due to two or three parameter tweaks, and that all the other changes suggested by SPSA are just noise. The lines pointed out by @snicolet might make no difference to sf's search at all, and will just add extra complexity and lose 0.1 elo due to the slowdown.

I think @xoto10's test with only the pure parameter tweaks is the way to go. If it passes (or doesn't), the other changes can be tested separately.

@Sopel97
Copy link
Member

Sopel97 commented Aug 24, 2019

would it be considered a parameter tune if all multipliers were changed to floats? I think it's ridiculous. The slowdown is clearly not a problem.
Why not go about the usual way and let it be and maybe someone will simplify it later

@MortenLohne
Copy link

This is all a matter of opinion of course, but in my opinion, no. Floating point operations are significantly more expensive than their integer counterparts, so it just isn't the same code anymore. This again violates the rule that different code changes should be tested separately whenever possible.

I see that @xoto10 is currently testing the patch both with only the simple parameter changes, and with only the added fractional parameters. These tests will be an important data point in the discussion.

@adentong
Copy link

adentong commented Aug 24, 2019

@MortenLohne Actually the results are completely meaningless if we test them separately since the values were tuned together. I imagine any potential elo gain would be brought about by the two halves working together; in fact I personally wouldn't bother testing them separately. As I said, sure some of the parameters may contribute to the majority of any potential elo gain, but we can't possibly figure out which ones do. The best thing to do is to trust the process and if there's added complexity then test it as such i.e. with [0.5, 4.5].

@MortenLohne
Copy link

@adentong While that's sort of true in theory, in practice I think it's putting way too much faith in SPSA. It's not a mathematically sound optimization method to begin with (i.e. it does not converge to a maximum point), and expecting it to discover some particular combination of values that synergize perfectly after 89k games is just not realistic. Just look at all the SPSA tunes on fishtest that have appeared to converge really well, only to fail spectacularly to actually gain elo.

Since this tune did show elo gain, at least for a particular time control, it must have discovered something. But 90% of the elo probably comes from 10% of the changes, and when implementing it all adds so much complexity, we should add least explore whether that complexity is actually necessary.

@Sopel97
Copy link
Member

Sopel97 commented Aug 24, 2019

I don't trust SPSA but I do trust SPRT.
And it passed easly for the tuned values, failed for the first part, and is failing for the second part right now.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Aug 24, 2019

I think its essential to test on 8-cores LTC in order to verify that the elo gain does not derive mainly from the short 20+0.2 TC. Also while the Vondele explanation seems solid, it could be the case that while in short TC the depth is important width is covered from the many threads and heavier pruning is optimal, at VVLTC it could be possible that as the width of the search tree of each thread grows, the work of the threads overlaps with each other more, thus losing effectiveness and with the risk of all of them being blind to a rare pruned move. The tests in general show an extreme TC sensitivity, hence between 2 patches that perform the same at LTC, the favorable one should be the one which performs WORSE at STC (given adequate confidence). But since LTC multithreaded tests are quite expensive, an interesting idea would be to test the candidate setups at VVSTC, VSTC and STC with the intention to identify the scaling tendency, which is probably more crucial than the STC elo gain. The reason for this is that the TC's that we mostly care about (ie TCEC TC) are a multitude of times higher than the STC, so if a patch that starts with +5 elo has a negative scaling curve at some point it would regress, while a patch that starts at -5 elo with a positive curve would bring elo. In general I think that the testing methodology is:

a. very effective for neutral scaling, as we get combined STC+LTC confidence
b. adequate for negative scaling, as LTC safeguards us ( but we have to be careful for stuff that pass STC fast and LTC with a ton of games)
c. ineffective for positive scaling, as out of this field the STC requirement would allow just the ones which bring STC elo gain despite having positive scaling! This obviously is a vast minority of the c. category, and even though we get rid of unwanted positive scaling stuff which ie start at -5 and end at -1 elo, the progress is being hurt enormously.

A shy workaround for this are are LTC-tuned stuff, privileged to skip the STC cutoff. The raise of the STC is a supported idea that IMO would help but at a high resource cost, additive to the cost of new STC bounds. Another idea is to specifically try to locate scaling irregularities, since this is the problematic field, and this could be done cheaply with a VVSTC test combined with the STC one. So instead of 20+0.2, a combination of 10+0.1 and say 2+0.02 for every candidate patch seems to bring a lot of info at a very low cost (or why not even 8+0.08 and 2+0.02 for no cost at all) : We would have:

  1. Double the confidence (at the same cost) for neutral scaling before LTC and x1.5 in tandem.
  2. At our costly,precise (0.5-4.5) bounds, a very promising indication of scaling irregularities that would enable us to check at LTC some stuff that would fail VVSTC a lot harder than STC.

I see no drawbacks at all for this method compared to the old one with keeping same bounds, but most probably a reconfiguration of optimal bounds will benefit further.

@snicolet
Copy link
Member

The funny thing is that a couple of months ago, after a long and passionate discussion, we "carefully" changed the bounds for the STC and LTC tests, choosing stricter bounds for STC and LTC so that we would have more confidence that the accepted patches would not be regressions.

And now people do speculative LTC no matter what, completely ignoring the STC results and almost claiming that it is a better patch if it fails STC!

A small contradiction, perhaps?

I will tell you: next step will be to do speculative VLTC at 32 threads for each and every patch, and the framework will die. Of course we can slowdown SF development by a factor of 20 very easily just for the fun of it, but we should realize that what we have done in ten years would have been done in two hundred years if we used 20 times more ressources for every test...

@xoto10
Copy link
Contributor Author

xoto10 commented Aug 25, 2019

OK, there are multiple problems with the "whole" patch from this second tune; the FutilityMargin array I based it on was not merged but replaced with futility_margin(), I included some contempt changes which are badly formulated (use Color when they should be testing the eval in previousScore), maintainers are not happy with the added multiplications, and maybe more.

So I am trying to see what we can gain from this.

Pure param changes: an obvious step is to just use the simple param changes in a normal [0,4] patch, unfortunately this failed non-regression on 1 thread badly:
60+0.6 th 1: -4.75 Elo +/- 3.7
LLR: -2.95 (-2.94,2.94) [-3.00,1.00]
Total: 11676 W: 1841 L: 2013 D: 7822
This was looking positive but not huge on 8 threads (currently -1 prio):
20+0.2 th 8: +1.35 Elo +/-2.3
LLR: 0.43 (-2.94,2.94) [0.00,4.00]
Total: 26250 W: 4163 L: 4061 D: 18026
So just the pure param changes appears to be no good.

Complementing the above, I am testing the added fractions from the tune. This is info only since the contempt changes are badly formed and maintainers don't like adding operations in many places in one patch.
searchconst2 20+0.2 th 8:
LLR: 0.76 (-2.94,2.94) [0.50,4.50]
Total: 13539 W: 2100 L: 2011 D: 9428
Currently +2.2 Elo +/-3.2, I will likely stop or at least pause this at 15k games or so since it is using a lot of resource and is info only.
searchconst2^ 60+0.6 th 1 [-3,1] just started. Perhaps this should have used different bounds or fixed length to try and avoid too long a test? Any thoughts? We could reschedule as it has only just started ...

@xoto10
Copy link
Contributor Author

xoto10 commented Aug 25, 2019

I am also trying some individual changes from the tune, initially choosing the ones that change the bench the most. e.g. the change to reductions derived from statscore : searchconst3

Perhaps 1 or 2 of these will work as patches on their own.

@Sopel97
Copy link
Member

Sopel97 commented Aug 25, 2019

maybe a test of all tuned values at STC (the one that passed 8 th) would tell us something. maybe it's a complete fluke and 8 th vs 1 th has the discrepancy only because of time control.

@Alayan-stk-2
Copy link

The funny thing is that a couple of months ago, after a long and passionate discussion, we "carefully" changed the bounds for the STC and LTC tests, choosing stricter bounds for STC and LTC so that we would have more confidence that the accepted patches would not be regressions.

And now people do speculative LTC no matter what, completely ignoring the STC results and almost claiming that it is a better patch if it fails STC!

A small contradiction, perhaps?

STC was made harder to pass than it was before, which actually increases the "need" for spec. LTC. Because it is very hard to get a green STC (there is something like 5x more patches ending yellow with over 70K games than getting a green... I didn't collect stats for a long period, but this is roughly correct) ; and because elo gainers were so rare, speculative LTCs have been a common thing.

STC is supposedly here to avoid wasting resources in LTC which are going to fail, but if there is no green STC to run at LTC, then it's not really a waste to run spec. LTCs. A good bunch of merged elo gainers this year followed spec. LTCs.

One issue around fishtest is that the flexibility is very low. This issue about STC being harder to pass and the consequences has been visible for months, but it's almost impossible to get people to admit the need to rethink it and validate another set of rules which would eliminate the "speculative LTC" use.

This search patch, which has shown elo-loss at fishtest STC and massive gain at LTC, also brings the question of 10+0.1 being optimal when it comes to selecting patches to test at a higher TC.

@NKONSTANTAKIS
Copy link

@snicolet For patches that score similarly on LTC, from the confidence perspective its good to pass STC, but from the scaling perspective the harder it fails the better. It is a very simple concept, suppose we have 100% confidence, one patch scores -5 elo at STC and +5 at LTC, another scores +15 at STC and +5 at LTC. The first gains value with more time, the 2nd loses. Its just completely different conditions. Its natural that optimizations for depth 12 searches will not be optimal for depth 30 ones.

So the contradiction is indeed there and has always been, as we are both interested in resource economy by cheaply filtering out stuff and good scaling. In the past the LTC was raised from 40" to 60" in order to be more safe of decent scaling. (+resource increase, + scaling, = confidence) But as we saw there are stuff that give +10 elo at VLTC and score negatively at STC! So by promoting only the stuff which pass the expensive and accurate (0.5,4.5) STC we definitely miss a lot of gems. The recent changes were focused on getting more STC confidence, while relaxing LTC requirements. Regarding scaling we are less safe than before as we kept the same STC elo requirement while lowering the LTC one. (+resource usage,-scaling,+confidence) The initial plan anyway was to give them an evaluation period (with tuning bounds still pending) and I think its time to reconsider and open a new discussion. I have the vision to combine our interests of resource economy and good scaling.

@xoto10
Copy link
Contributor Author

xoto10 commented Aug 26, 2019

I think I'm mostly done with this second tune now, there have been a few patches and it seems out-of-date already. I think the idea that a bench well below 3000000 is possible, and gives good results on multiple threads is interesting knowledge (although it is maybe too brittle?), perhaps this will come in useful at some point ...
I suggest the tc debate moves to fishcooking or a new issue.

maybe a test of all tuned values at STC (the one that passed 8 th) would tell us something. maybe it's a complete fluke and 8 th vs 1 th has the discrepancy only because of time control.

@Sopel97 I have requested vary4.

mstembera pushed a commit to mstembera/Stockfish that referenced this pull request Aug 29, 2019
This is the result of a 200k tuning run at LTC:
http://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e

which passed quickly at LTC:
LLR: 2.95 (-2.94,2.94) [0.50,4.50]
Total: 12954 W: 2280 L: 2074 D: 8600
http://tests.stockfishchess.org/tests/view/5d3ff3f70ebc5925cf0f87a2

STC failed, but second LTC at [0,4] passed easily:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 8004 W: 1432 L: 1252 D: 5320
http://tests.stockfishchess.org/tests/view/5d407cff0ebc5925cf0f9119

Further work?
No doubt some of these changes produce most of the gain and some are neutral
or even bad, so further testing on individual/groups of parameters changed
here might show more gains. It does look like these tests might need to be
at LTC though, so maybe not too practical to do. See the thread in the pull
request for an interesting discussion:
official-stockfish#2260

Bench: 4024328
pb00068 pushed a commit to pb00068/Stockfish that referenced this pull request Sep 10, 2019
This is the result of a 200k tuning run at LTC:
http://tests.stockfishchess.org/tests/view/5d3576b70ebc5925cf0e9e1e

which passed quickly at LTC:
LLR: 2.95 (-2.94,2.94) [0.50,4.50]
Total: 12954 W: 2280 L: 2074 D: 8600
http://tests.stockfishchess.org/tests/view/5d3ff3f70ebc5925cf0f87a2

STC failed, but second LTC at [0,4] passed easily:
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 8004 W: 1432 L: 1252 D: 5320
http://tests.stockfishchess.org/tests/view/5d407cff0ebc5925cf0f9119

Further work?
No doubt some of these changes produce most of the gain and some are neutral
or even bad, so further testing on individual/groups of parameters changed
here might show more gains. It does look like these tests might need to be
at LTC though, so maybe not too practical to do. See the thread in the pull
request for an interesting discussion:
official-stockfish#2260

Bench: 4024328
@snicolet
Copy link
Member

The discussion seems to have dried, so I close the PR.

@snicolet snicolet closed this Sep 14, 2019
@gonzalezjo
Copy link

Do you happen to recall (or have written down) the SPSA parameters that you used for this tune?

@vondele
Copy link
Member

vondele commented Aug 18, 2020

@Rocky640
Copy link

SPSA A=5000
SPSA Alpha = 0.602
SPSA Gamma = 0.101
numgames = 200000
TC 60+0.6
Hash=64

RM,600,0,1200,60,0.0020
FM1,175,0,350,17.5,0.0020
FM2,50,0,100,5,0.0020
RE1,512,0,1024,51.2,0.0020
RE2,1024,0,2048,102.4,0.0020
FC1,5,0,10,0.5,0.0020
FC2,1,0,2,0.1,0.0020
SB2,0,-100,100,10,0.0020
SB3,29,0,58,2.9,0.0020
SB4,138,0,276,13.8,0.0020
SB5,134,0,268,13.4,0.0020
VD1,4,0,8,0.4,0.0020
VD2,2,0,4,0.2,0.0020
VD3,1,0,2,0.1,0.0020
VD4,1,0,2,0.1,0.0020
TH1,8,0,16,0.8,0.0020
IN,229,0,458,22.9,0.0020
CT1,2,1,4,0.15,0.0020
CT2,2,1,4,0.15,0.0020
BM1,2,1,4,0.15,0.0020
DC3,2,1,4,0.15,0.0020
DC4,2,1,4,0.15,0.0020
EX4,2,1,4,0.15,0.0020
DE2,4,1,8,0.35,0.0020
IN,229,0,458,22.9,0.0020
TR,1,0,2,0.1,0.0020
AS1,5,0,10,0.5,0.0020
DE,20,0,40,2,0.0020
DC1,88,0,176,8.8,0.0020
DC2,200,0,400,20,0.0020
AB1,512,0,1024,51.2,0.0020
DE3,5,0,10,0.5,0.0020
FE1,314,0,628,31.4,0.0020
FE2,9,0,18,0.9,0.0020
FE3,581,0,1162,58.1,0.0020
TR1,10,0,20,1,0.0020
TR2,195,0,390,19.5,0.0020
TR3,100,0,200,10,0.0020
TR4,125,0,250,12.5,0.0020
TR5,225,0,450,22.5,0.0020
MC1,2,0,4,0.2,0.0020
RZ,2,0,4,0.2,0.0020
FP,7,0,14,0.7,0.0020
NM1,23200,0,46400,2320,0.0020
NM2,36,0,72,3.6,0.0020
NM3,225,0,450,22.5,0.0020
NM4,823,0,1646,82.3,0.0020
NM5,67,0,134,6.7,0.0020
NM6,200,0,400,20,0.0020
NM7,3,0,6,0.3,0.0020
NM8,12,0,24,1.2,0.0020
NM9,3,0,6,0.3,0.0020
PB1,5,0,10,0.5,0.0020
PB2,216,0,432,21.6,0.0020
PB3,48,0,96,4.8,0.0020
PB4,2,0,4,0.2,0.0020
PB5,2,0,4,0.2,0.0020
PB6,4,0,8,0.4,0.0020
II1,8,0,16,0.8,0.0020
II2,7,0,14,0.7,0.0020
EX1,8,0,16,0.8,0.0020
EX2,3,0,6,0.3,0.0020
EX3,2,0,4,0.2,0.0020
EX5,3,0,6,0.3,0.0020
EX6,39,0,78,3.9,0.0020
LM1,3,0,6,0.3,0.0020
LM2,7,0,14,0.7,0.0020
LM3,256,0,512,25.6,0.0020
LM4,200,0,400,20,0.0020
NS1,-29,-58,0,2.9,0.0020
NS2,-213,-426,0,21.3,0.0020
LM5,3,0,6,0.3,0.0020
LM6,1,0,2,0.1,0.0020
LM7,3,0,6,0.3,0.0020
LM8,2,0,4,0.2,0.0020
LM9,15,0,30,1.5,0.0020
LM10,2,0,4,0.2,0.0020
LM11,2,0,4,0.2,0.0020
LM12,4000,0,8000,400,0.0020
UC1,3,0,6,0.3,0.0020
FB,128,0,256,12.8,0.0020
EP,2,0,4,0.2,0.0020
LM13,0,-1000,1000,100,0.0020
LM14,0,-1000,1000,100,0.0020
LM15,0,-1000,1000,100,0.0020
LM16,0,-1000,1000,100,0.0020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.