Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative regression??? #2531

Closed
adentong opened this issue Jan 28, 2020 · 107 comments
Closed

Negative regression??? #2531

adentong opened this issue Jan 28, 2020 · 107 comments

Comments

@adentong
Copy link

-2.38 +-4.7 after about 5000 games. obviously still very early, but after 7 elo gainers this is not what I was expecting...

@joergoster
Copy link
Contributor

Look at the number of games taken for most of the passed elo gainers. Totally unreliable, imho.

@Alayan-stk-2
Copy link

Alayan-stk-2 commented Jan 28, 2020

SPRT elo estimates are not accurate, but the confidence of superiority is very good. It's designed to stop once there is a very high confidence it doesn't regress and a high confidence it progresses.

The regression test is far from finished, but at the very best it will be something like +1 or +2 elo. That's bad news.

@NKONSTANTAKIS
Copy link

It will be valuable to repeat RT with just the elo gainers after sf11. (Master - ( simpl + non funct)). In the past a couple of this kinds of tests showed no problem but that doesnt mean it will be forever no problem. Now its also different bounds and statistics.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Jan 28, 2020

Another interesting RT will be with the noob3 book, as the optimization we do for it might not correlate that strong with 8-moves performance. If thats the case, probably its a good idea to do RT's on noob3, for better tracking of progress, more accuracy and less panicking.

@NKONSTANTAKIS
Copy link

Another thing to examine is 0,2 LTC possibly being too easy, allowing too many false positives through sheer number of tests. We basically rely on a single test, current STC adds very little LTC confidence.

@Sopel97
Copy link
Member

Sopel97 commented Jan 29, 2020

With current settings a probability of a <0 elo patch passing LTC is 5% according to https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0&elo-1=2&draw-ratio=0.61&rms-bias=0. NKONSTANTAKIS made a good point about the book being different. If it is indeed the book making a difference then it's gonna be fun...

@Rocky640
Copy link

Rocky640 commented Jan 29, 2020

If we only look at the 15 tests between SF11 and 200128,
3 things stands out and they happened in the most recent 5 tests committed

a) bench jumped from 4725546 (test no 14) to 5545845 (test no 15) on 200128

b) 2 LTC finished before 10000 games (test no 12 and test no 14)

c) http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1
(test no 11) is currently struggling, although it already passed a STC regression
http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1

I therefore suggest to test SF11 against test no 10, which is
6d0eabd

and then we can continue bissecting up or down from there.

@Rocky640
Copy link

Rocky640 commented Jan 29, 2020

https://www.sp-cc.de/ tested Stockfish 11 official 200118 playing against a pool of AB engines and compared with Stockfish 200107 playing against same pool of engines, at 180+1

He says "AB-testrun of Stockfish 11 finished. Sadly a clear regression to Stockfish 200107 (-7 Elo)"

This is only one result, but there is a puzzling coincidence:
the first commits of test which used the new pentanomial model, 3 moves book and new SPRT bounds were done on January 7. Maybe the problem started there ?

@Alayan-stk-2
Copy link

The error bars with 5K games against significantly weaker opponents do not allow to tell with a reasonable degree of confidence that SF11 would be inferior to SF 200107.

@31m059
Copy link

31m059 commented Jan 29, 2020

@Rocky640 That is possible, but certainly would be surprising...our own regression testing showed a slight single-threaded gain from January 7 to SF11. While that is close-to-within error bars, we should have detected a 7-Elo regression...

@adentong
Copy link
Author

adentong commented Jan 29, 2020

Should we temporarily disallow any new patch submissions to fishtest and stop all currently running ones, so we can focus our attention on this issue, since we'd probably want to run a bunch of regression tests to pinpoint the problem? Seems like this could have some pretty big consequences on the future of SF's improvement.

@ddugovic
Copy link

The fishtest server has a priority field (which could be increased for regression tests, if those are a priority).

@vondele
Copy link
Member

vondele commented Jan 29, 2020

let's look at this carefully, but without hectic ...

first, the results so far (RT not yet finished) is consistent with NCM https://nextchessmove.com/dev-builds where no single patch stands out.

second, as next thing I will test with the book which is now being used for development, out of curiosity to see if this matters.

@31m059
Copy link

31m059 commented Jan 29, 2020

I don't think it will be necessary to stop all other ongoing tests, or to elevate priority...those are drastic measures. We can use throughput for a "softer" approach.

But since the new throughput calculation prefers positive LLR, @snicolet's non-regression verification test is going to lose workers as it progresses towards LLR = -2.94 (the more informative result, since failure may mean reversion of that commit). Therefore, I've artificially raised its throughput to 150% for now. Hopefully, this represents a good compromise...

@vondele
Copy link
Member

vondele commented Jan 29, 2020

@31m059 no need... failure of that test will not necessarily imply revert. First, I think starting those tests was premature, second, let's not forget the statistical nature of fishtest.

@31m059
Copy link

31m059 commented Jan 29, 2020

@vondele My apologies, I've now restored the default throughput. I agree with your approach of exercising restraint here.

@Vizvezdenec
Copy link
Contributor

okay, I didn't notice this topic, I will repost there :)

@Vizvezdenec
Copy link
Contributor

Vizvezdenec commented Jan 29, 2020

I think it's about time to respin this discussion after quite dissapointing regression test (it's not finished but it's quite obvious that it most likely wouldn't finish positive).
So, we made STC bounds really loose and now probability of patch being not a regression became 18% * 5%, so like 0.9%, it seems that it's too much - 7 elo gainers result in what seems to be slightly negative elo gain.
I guess we should do smth with this.

Most obvious is that we probably should do simplification attempts for all 7 passed patches that made it into master since sf11 release. Probably just at LTC;
It seems that 0,9% is too high of a chance for negative patch to pass. Probably since we want loose STC bounds to give more patches shot at LTC we should slightly stricten LTC bounds themselves since a lot of patches that are negative go to LTC and each of them has decent chance to pass.
My proposition will be to change LTC SPRT bounds to {0.25, 2.25} or {0.5, 2.5} - second one is closer to 0,25% of regression chance we had like forever, first one will allow more patches to pass.
I think that the stronger engine gets the stricter should be % of non-regression (yes, it's sad, because less patches will pass) because % of passed patches becomes lower and lower thus more and more patches are tested and more and more patches lie in "slightly negative" zone.
Maybe we can also slightly move lower bound of STC.
I think good compromise between everything can be
STC {-0.5; 3}, LTC {0.25; 2.25}.
Chances of negative patch will be smth like 0,26% - more or less what we used for years, we will have slightly less LTCs (which is, imho, a good thing, nowadays we run infinite LTCs most of them are not even close to passing), overall game number wouldn't increase this much, STC-LTC correlation will be slightly more reliable.
I guess it's all from me for now, your opinion is really appreciated :)

@pb00068
Copy link
Contributor

pb00068 commented Jan 29, 2020

@Vizvezdenec I appreciate your proposition but IMO it's to early to take measures. Let us first try understand (and measure) what went wrong.

@Vizvezdenec
Copy link
Contributor

Well first stuff that should be done is trying to simplify passed tests with {-1,5; 0.5}, imho.

@xoto10
Copy link
Contributor

xoto10 commented Jan 29, 2020 via email

@NKONSTANTAKIS
Copy link

The regression is not possible to have come from the LTC gainers. Even if we hitted the 5% "jackpot" 3/7 the rest would cover up the tiny elo loss, landing us into +1-2 at worst.

Nevertheless, STC -0.5,3 definitely agreed + 15" for extra correlation. LTC maybe then will be fine as is, (viz suggestion sounds good too, but it might rarify greens too much) so STC 1st needed step and reassess.

More on topic,
Sinplifications with current -1.5,0.5 feel like passing easier than before, and one has to question their value in general. Imo if they are not removing serious code its not worth it. Now ie we have a visible ocb weakness (from ccc observation), because the cycle of small regression into new better formula didnt suceed (yet).

Large number of non-functional patches, untested, hold the risk of random unforeseen side-effects, like it seems to be in this case. Probably its wise to do "better safe than sorry" type of tests like sn often does, as neither humans nor compilers can be 100% trusted.

@Vizvezdenec
Copy link
Contributor

Vizvezdenec commented Jan 29, 2020

Oh, again simplifications are the cultprit, of course.
Heard it a lot of times and have never seen tests of 20 simplifications together being negative elo - it was done multiple times and NEVER had shown result worse than -1 elo.
But "elo gaining tests" being actually negative even with old bounds is nothing that is unheard of and it's even more probable with current ones.
Just saying that with number of patches we test and % of patches passing we should actually decrease % of patches that can pass being negative.
Nowadays 0,25% threshold for passing patch is -0.3 elo and 0.1% is -0.5 elo , with previous bounds we had this as 0 elo (0,5 0,5 and 0,5;4,5 and 0; 3,5 respectively).
With our goal to be better at 60+0.6 and mostly rely on this test as an indication of this patch being good, imho, it's pretty logical to make lower bound > 0 so LTC regressions will have really low chance to pass. Even at cost of lower number of patches passing in general.

@owldw21
Copy link

owldw21 commented Jan 29, 2020

imo starting a RT with only simplifications wouldn't hurt too much, and can determine if the SPRT bounds are too loose.

@vdbergh
Copy link
Contributor

vdbergh commented Jan 29, 2020

I think we should wait for the tests to finish before discussing anything. Note that from the test by @vondele it may well follow that we are just witnessing a case of selection bias. This could be a confirmation of the existence of such a phenomenon (on which I have speculated in many posts).

@Vizvezdenec
Copy link
Contributor

Well, opening dependancy for sure can be a thing. But tbh it's within error bounds.
Also I need to say - if LTC on 2 moves book ends with 0 it's still pretty bad.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Jan 29, 2020

Too many simplifications removed at once is extremely noisy test, as in between a lot of elo gainers were adapted on the basis of the simplifications. Turning them on at once would distort the elo gainers performance unpedictably. 20 is a ridiculous number, useless test. I dont understand the rationale for suspecting just the 1% chance of elo gainers regressing while at the same time -1.5,0.5 means that a -0.5 elo test is more or less a coin flip, so 25% to pass both and be merged.

Edit: An interesting usage of a new master - 20 simpl vs master test would be as comparison with an old master + 20 simpl. This way one can measure their relative dependency of that period.

@NKONSTANTAKIS
Copy link

Error 2 elo, book 2 elo, linux patch 2 elo, simpl 2 elo, sum 8 elo, hence -3 instead of +5. Easy.
Case 1. Is unavoidable
2. Easily eliminated with same book RT
3. Requires constant caution for untested changes
4. Elo is much harder to get now than it was a decade ago when -3,1 was established. We can adapt our price to simpl ratio to current needs. For example with -1.25,0.75

@adentong
Copy link
Author

So from the regression tests it doesn't look like book is the problem. We also know that the linux large page commit failed to pass non regression.

Vizvezdenec referenced this issue in Vizvezdenec/Stockfish Jan 29, 2020
bench 5024857
@Vizvezdenec
Copy link
Contributor

Eh, I don't have time to do it now. Can probably do it somewhere near 10 hours from now :)

@ddobbelaere
Copy link
Contributor

Doesn't it also make more sense to take [-1, 1] for simplications?

In the proposition of @vondele a +1elo patch has 50% chance of passing and its revert 50% chance of being simplified away.

@vondele
Copy link
Member

vondele commented Jan 31, 2020

@ddobbelaere not quite right, a simplification that removes a 1Elo feature has about ~4% chance to pass both STC and LTC (quick check, correct if wrong).

@vdbergh
Copy link
Contributor

vdbergh commented Jan 31, 2020

@ddobbelaere There is currently already an issue that a neutral patch has relatively high probability of failing a simplification test. SPRT {-1,1} would make that worse.

The probability of simplifying away a 1 Elo patch away is far less than 50%.

@Vizvezdenec
Copy link
Contributor

I prefer STC to be smth like {-0.25, 1.75}
0.75 elo 50%, negative patch overall will have like 0,18% chances of passing, also upper bound being the same looks aesthetically more right :)

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Jan 31, 2020

@vondele @vdbergh You're right, sorry

My statement holds for a +0.5 elo patch, (0.5 lies in middle of [-0.5; 1.5], -0.5 lies in middle of [-1.5; 0.5], so 50% chance of passing STC). Actually with the proposed bounds, it's still less likely that a 0.5 elo patch passes both STC and LTC (with higher bounds than STC) than that a -0.5 elo simplification passes both STC and LTC.

Anyway, I think the additional 'constraint' that most simplications don't just revert earlier patches but actually make the code more readable/smaller is helping us here.

@xoto10
Copy link
Contributor

xoto10 commented Jan 31, 2020

If we focus on -0.5 Elo patches, my understanding is the numbers change from 10% STC pass and 1.2% LTC pass now, to 5% and 0.3%, is that correct?
The change sounds reasonable on that basis, but it doesn't feel like only 1.2% of our -0.5 Elo patches have been passing LTC - is the SPRT calculator underestimating this somehow? If it is, is this change likely to be good anyway, or can we refine it?

Edit: "have been passing LTC"

@xoto10
Copy link
Contributor

xoto10 commented Jan 31, 2020 via email

vondele added a commit that referenced this issue Jan 31, 2020
Revert 5 patches which were merged, but lead to a regression test that showed negative Elo gain:

http://tests.stockfishchess.org/tests/view/5e307251ab2d69d58394fdb9

This was discussed in depth in:

#2531

Each patch was removed and tested as a simplification, full list below, and the whole combo as well.

After the revert the regression test showed a neutral result:

http://tests.stockfishchess.org/tests/view/5e334851708b13464ceea33c

As a result of this experience, the SPRT testing bounds will be made more strict.

Reverted patches:

1 Dynamic Complexity 6d0eabd :

STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31fcacec661e2e6a340d08 :
LLR: 2.97 (-2.94,2.94) {-1.50,0.50}
Total: 38130 W: 7326 L: 7189 D: 23615
Ptnml(0-2): 677, 4346, 8843, 4545, 646

LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c18fec661e2e6a340d73 :
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 38675 W: 4941 L: 4866 D: 28868
Ptnml(0-2): 270, 3556, 11429, 3584, 291

3 More bonus for bestMoves on past PV nodes 71e0b53 :

STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31fe93ec661e2e6a340d10 :
LLR: 2.95 (-2.94,2.94) {-1.50,0.50}
Total: 46100 W: 8853 L: 8727 D: 28520
Ptnml(0-2): 796, 5297, 10749, 5387, 813

LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c187ec661e2e6a340d71 :
LLR: 2.96 (-2.94,2.94) {-1.50,0.50}
Total: 16920 W: 2161 L: 2055 D: 12704
Ptnml(0-2): 115, 1498, 5006, 1569, 130

4 Tweak Restricted Piece Bonus 0ae0045 :

STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31fefaec661e2e6a340d15 :
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 88328 W: 17060 L: 16997 D: 54271
Ptnml(0-2): 1536, 10446, 20169, 10422, 1581

LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c17aec661e2e6a340d6f :
LLR: 2.95 (-2.94,2.94) {-1.50,0.50}
Total: 34784 W: 4551 L: 4466 D: 25767
Ptnml(0-2): 255, 3279, 10061, 3345, 262

5 History update for pruned captures 01b6088 :

STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31ff5eec661e2e6a340d1a :
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 29541 W: 5735 L: 5588 D: 18218
Ptnml(0-2): 483, 3445, 6820, 3469, 545

LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c196ec661e2e6a340d75 :
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 22177 W: 2854 L: 2757 D: 16566
Ptnml(0-2): 143, 2005, 6555, 2055, 164

6 Tweak trapped rook penalty 18fc21e :

STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31ffb1ec661e2e6a340d1c :
LLR: 2.95 (-2.94,2.94) {-1.50,0.50}
Total: 24476 W: 4727 L: 4569 D: 15180
Ptnml(0-2): 390, 2834, 5659, 2933, 417

LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c19eec661e2e6a340d77 :
LLR: 2.95 (-2.94,2.94) {-1.50,0.50}
Total: 97332 W: 12492 L: 12466 D: 72374
Ptnml(0-2): 690, 9107, 28738, 9034, 720

All 5 as one simplification :
LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e334098708b13464ceea330 :
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 7829 W: 1079 L: 964 D: 5786
Ptnml(0-2): 52, 690, 2281, 781, 65

Bench: 5153165
@Vizvezdenec
Copy link
Contributor

@vondele so what about new bounds?
I'm starting tests with the ones you proposed for now, want to really get back into writing something and not debugging ;) .

@vondele
Copy link
Member

vondele commented Jan 31, 2020

preparing a fishtest commit right now...

vondele added a commit to vondele/fishtest that referenced this issue Jan 31, 2020
as per discussion in official-stockfish/Stockfish#2531

introduce new bounds:

standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo
standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo
simplification STC/LTC {-1.5, 0.5} unmodified.
@Vizvezdenec
Copy link
Contributor

Okay, I just started some tests with adjusted bounds (customly selecting them).

vondele added a commit to vondele/fishtest that referenced this issue Jan 31, 2020
as per discussion in official-stockfish/Stockfish#2531

introduce new bounds:

standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo
standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo
simplification STC/LTC {-1.5, 0.5} unmodified.
ppigazzini pushed a commit to official-stockfish/fishtest that referenced this issue Jan 31, 2020
as per discussion in official-stockfish/Stockfish#2531

introduce new bounds:

standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo
standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo
simplification STC/LTC {-1.5, 0.5} unmodified.
@vondele
Copy link
Member

vondele commented Jan 31, 2020

The regressing tests have been reverted 6ccb1ca and the new SPRT bounds introduced official-stockfish/fishtest@cc84d9c

This issue has been resolved. I'm sure the process has been useful, and I hope the new bounds help to make progress quickly and in a robust fashion.

@mstembera
Copy link
Contributor

#2538

@MJZ1977
Copy link
Contributor

MJZ1977 commented Feb 21, 2020

I am back after some months of absence, so sorry to react with a lot of delay.

This is just my personal feeling about this huge change in SPRT bounds :

  • the new bounds are better and much robust than older ones,
  • as always the drawback is the long test queue with 50k+ tests.

So I will kindly suggest a small adjustment : [-0.5, 2] for STC. This will significantly decrease number of games for ~0 ELO patchs which are not a potential big improvements to SF. The drawback is a small increase in +1ELO failures but after all this should be acceptable if we concentrate our efforts and ressources on better patchs.

@vondele
Copy link
Member

vondele commented Feb 21, 2020

I'd like to keep the bounds like this for a while to check... in the past days, the number of tests running has been relatively modest, and overall throughput still OK. The bounds are such that 'patches that scale with TC' have a reasonable chance to make STC and perform at LTC.

@31m059
Copy link

31m059 commented Feb 26, 2020

@vondele For what it's worth, I agree with @MJZ1977. While it might seem true that a modest number of test submissions can support more precise bounds, the problem is that since developers are not blind to the currently running tests, the number of tests running and the number of tests submitted are not independent. The more tests are running, the fewer tests are submitted.

I know that I often deliberately forgo submitting tests, or reduce the number of variations I submit, when I see the framework is under heavy load (out of courtesy to other developers). I'm sure I'm not the only one who does this...

@vondele
Copy link
Member

vondele commented Feb 26, 2020

Clearly, there is a correlation between running tests and tests submitted. That's natural and good, I think. On a nearly empty framework, there is lots of random stuff, and countless variations being run... that's not necessarily good. Right now, running time of a test requiring 100k STC games is still <24h.

I went once through the list of tests that ran since the introduction of the new bounds.... if I did the counting correctly, we had a total of 750 tests running, 50 where Elo-gainer LTC, 3 passed {0.25, 1.75}. So about 6% promote from STC to LTC and about 6% of Elo gainers LTC are successful.

If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want?

@MJZ1977
Copy link
Contributor

MJZ1977 commented Feb 26, 2020

I don't think that changing from {-0.5, 1.5} to {-0.5, 2.0} will dramatically kill 'good scalers'. In the last 50 patchs how many would have been lost because we have slightly changed these bounds?

Now if we had made 1500 tests instead of 750, my feeling is that we should find more successful ELO gainers at the end. And of course we are not in the extreme of last months with STC tests finishing in 20 minutes with empty framework.

We can notice that from STC finishing in 20 min to STC finishing in 20 hours there is a big gap :-)

@31m059
Copy link

31m059 commented Feb 26, 2020

@vondele

If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want?

(emphasis added)

I don't think this is correct, though. A smaller proportion of tests pass, but this is offset by an increased number of tests. Counter-intuitively, we should expect more STC greens if we make the bounds stricter (up to the point that developers no longer can submit good tests fast enough to meet the added framework capacity).

Please allow me to clarify what I mean:

Current STC bounds: [-.5, 1.5]
Expected number of games for a neutral (+0) test: 86,900
Expected pass rate of +0.75 Elo tests: 67.6%
Expected pass rate of +1 Elo tests: 81.3%

Modified STC bounds: [-.5, 2]
Expected number of games for a neutral (+0) test: 52,300
Expected pass rate of +0.75 Elo tests: 50%
Expected pass rate of +1 Elo tests: 64.3%

The capacity of the framework occupied by running +0 tests would be 1 - 52300 / 86900 = 40% lower.

The pass rate for +0.75 Elo tests would only be 1 - 50/67.6 = 26% lower.
The pass rate for +1.00 Elo tests would only be 1 - 64.3/81.3 = 21% lower.

So the increase in capacity for new tests outpaces the decrease in pass rate. Assuming the distribution of submitted test Elo doesn't change (which is a potentially flawed but unavoidable assumption), we would expect 23% more +0.75 tests to pass, and 32% more +1.00 tests, unless I've made an error in the calculations.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Feb 27, 2020

Assuming neutral scaling, we currently harvest just 40.5% (81%*50%) of +1 elo patches.
With new proposal this will go down to 32%.

I think at current state strong elo gainers are almost non existent, and elo gainers in general too rare. Personally I feel sad when a parameter tweak or a modest code adder fails at high LTC count. Usually it represents a harmless clear improvement which fell a bit short, and with tons of resources invested. Even by skipping the game-count indication, the +0.5 to +0.8 elo range obviously has very low passing chances. Especially when a solid idea of about that elo range is tried in multiple versions, it consumes extreme resources.

I think that an important question is:
If we knew the exact elo gain of a patch what would be the requirement?

I assume around 0.5 elo. Obviously due to lack of perfect confidence the bounds also serve the purpose of making it less likely for useless or regressing code to sneak in. Hence the LTC mid-point target of +1 elo, with very high confidence.

I still don't understand why we use same bounds for tweaks and code adders. When we are willing to use considerable resources and pay a little elo for cleaner and less code via simplifications, shouldn't that reversely reflect?
Are a 0 elo tweak and a 0 elo code adder equally bad when fluking in?
What about a +0.2 patch? I can't think why someone would prefer the slightly inferior parameters but certainly sub-par code hurts long-term.

The way I see it there are 2 safely profitable possibilities:

  1. We could economize considerable resources by exploiting the harmless nature of tweaks through lower confidence.
  2. We could enable continuous small elo micro-optimization with a statistical model of average elo parameter tweak acceptance.

Elaborating 2. is an effort to find the most efficient way for example to be plus 2-3 elo after 10 tweaks bypassing accurate individual performance. Currently this is scarcely attempted with combos, but this method is clearly inefficient cause firstly normal bounds are tried, then the combo selections are chosen on the shaky basis of sprt gamecount.
My proposition is basically to profitably combo the elements straight into master.

@MJZ1977
Copy link
Contributor

MJZ1977 commented Feb 27, 2020

The logic of @31m059 is even bigger for +2 ELO and+3 ELO ....

Edit : for me we must avoid extremes.

  • quick tests with low robustness (10k games) and often empty framework
  • heavy tests with long test queue and people discouraged to launch new tests

Of course all depends on available cores in the framework. If we have 5k cores, [-0.5, 1.5] is clearly the best choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests