New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative regression??? #2531
Comments
|
Look at the number of games taken for most of the passed elo gainers. Totally unreliable, imho. |
|
SPRT elo estimates are not accurate, but the confidence of superiority is very good. It's designed to stop once there is a very high confidence it doesn't regress and a high confidence it progresses. The regression test is far from finished, but at the very best it will be something like +1 or +2 elo. That's bad news. |
|
It will be valuable to repeat RT with just the elo gainers after sf11. (Master - ( simpl + non funct)). In the past a couple of this kinds of tests showed no problem but that doesnt mean it will be forever no problem. Now its also different bounds and statistics. |
|
Another interesting RT will be with the noob3 book, as the optimization we do for it might not correlate that strong with 8-moves performance. If thats the case, probably its a good idea to do RT's on noob3, for better tracking of progress, more accuracy and less panicking. |
|
Another thing to examine is 0,2 LTC possibly being too easy, allowing too many false positives through sheer number of tests. We basically rely on a single test, current STC adds very little LTC confidence. |
|
With current settings a probability of a <0 elo patch passing LTC is 5% according to https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0&elo-1=2&draw-ratio=0.61&rms-bias=0. NKONSTANTAKIS made a good point about the book being different. If it is indeed the book making a difference then it's gonna be fun... |
|
If we only look at the 15 tests between SF11 and 200128, a) bench jumped from 4725546 (test no 14) to 5545845 (test no 15) on 200128 b) 2 LTC finished before 10000 games (test no 12 and test no 14) c) http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1 I therefore suggest to test SF11 against test no 10, which is and then we can continue bissecting up or down from there. |
|
https://www.sp-cc.de/ tested Stockfish 11 official 200118 playing against a pool of AB engines and compared with Stockfish 200107 playing against same pool of engines, at 180+1 He says "AB-testrun of Stockfish 11 finished. Sadly a clear regression to Stockfish 200107 (-7 Elo)" This is only one result, but there is a puzzling coincidence: |
|
The error bars with 5K games against significantly weaker opponents do not allow to tell with a reasonable degree of confidence that SF11 would be inferior to SF 200107. |
|
@Rocky640 That is possible, but certainly would be surprising...our own regression testing showed a slight single-threaded gain from January 7 to SF11. While that is close-to-within error bars, we should have detected a 7-Elo regression... |
|
Should we temporarily disallow any new patch submissions to fishtest and stop all currently running ones, so we can focus our attention on this issue, since we'd probably want to run a bunch of regression tests to pinpoint the problem? Seems like this could have some pretty big consequences on the future of SF's improvement. |
|
The fishtest server has a priority field (which could be increased for regression tests, if those are a priority). |
|
let's look at this carefully, but without hectic ... first, the results so far (RT not yet finished) is consistent with NCM https://nextchessmove.com/dev-builds where no single patch stands out. second, as next thing I will test with the book which is now being used for development, out of curiosity to see if this matters. |
|
I don't think it will be necessary to stop all other ongoing tests, or to elevate priority...those are drastic measures. We can use throughput for a "softer" approach. But since the new throughput calculation prefers positive LLR, @snicolet's non-regression verification test is going to lose workers as it progresses towards LLR = -2.94 (the more informative result, since failure may mean reversion of that commit). Therefore, I've artificially raised its throughput to 150% for now. Hopefully, this represents a good compromise... |
|
@31m059 no need... failure of that test will not necessarily imply revert. First, I think starting those tests was premature, second, let's not forget the statistical nature of fishtest. |
|
@vondele My apologies, I've now restored the default throughput. I agree with your approach of exercising restraint here. |
|
okay, I didn't notice this topic, I will repost there :) |
|
I think it's about time to respin this discussion after quite dissapointing regression test (it's not finished but it's quite obvious that it most likely wouldn't finish positive). Most obvious is that we probably should do simplification attempts for all 7 passed patches that made it into master since sf11 release. Probably just at LTC; |
|
@Vizvezdenec I appreciate your proposition but IMO it's to early to take measures. Let us first try understand (and measure) what went wrong. |
|
Well first stuff that should be done is trying to simplify passed tests with {-1,5; 0.5}, imho. |
|
Trying to simplify the passed patches (but maybe not all at once!) seems
like a good idea.
Alternatively, with 15 patches since sf11 how about testing the first 8
patches for progression / regression? If there is a single patch causing a
problem this would be a first step towards identifying it.
|
|
The regression is not possible to have come from the LTC gainers. Even if we hitted the 5% "jackpot" 3/7 the rest would cover up the tiny elo loss, landing us into +1-2 at worst. Nevertheless, STC -0.5,3 definitely agreed + 15" for extra correlation. LTC maybe then will be fine as is, (viz suggestion sounds good too, but it might rarify greens too much) so STC 1st needed step and reassess. More on topic, Large number of non-functional patches, untested, hold the risk of random unforeseen side-effects, like it seems to be in this case. Probably its wise to do "better safe than sorry" type of tests like sn often does, as neither humans nor compilers can be 100% trusted. |
|
Oh, again simplifications are the cultprit, of course. |
|
imo starting a RT with only simplifications wouldn't hurt too much, and can determine if the SPRT bounds are too loose. |
|
I think we should wait for the tests to finish before discussing anything. Note that from the test by @vondele it may well follow that we are just witnessing a case of selection bias. This could be a confirmation of the existence of such a phenomenon (on which I have speculated in many posts). |
|
Well, opening dependancy for sure can be a thing. But tbh it's within error bounds. |
|
Too many simplifications removed at once is extremely noisy test, as in between a lot of elo gainers were adapted on the basis of the simplifications. Turning them on at once would distort the elo gainers performance unpedictably. 20 is a ridiculous number, useless test. I dont understand the rationale for suspecting just the 1% chance of elo gainers regressing while at the same time -1.5,0.5 means that a -0.5 elo test is more or less a coin flip, so 25% to pass both and be merged. Edit: An interesting usage of a new master - 20 simpl vs master test would be as comparison with an old master + 20 simpl. This way one can measure their relative dependency of that period. |
|
Error 2 elo, book 2 elo, linux patch 2 elo, simpl 2 elo, sum 8 elo, hence -3 instead of +5. Easy. |
|
So from the regression tests it doesn't look like book is the problem. We also know that the linux large page commit failed to pass non regression. |
|
Eh, I don't have time to do it now. Can probably do it somewhere near 10 hours from now :) |
|
Doesn't it also make more sense to take [-1, 1] for simplications? In the proposition of @vondele a +1elo patch has 50% chance of passing and its revert 50% chance of being simplified away. |
|
@ddobbelaere not quite right, a simplification that removes a 1Elo feature has about ~4% chance to pass both STC and LTC (quick check, correct if wrong). |
|
@ddobbelaere There is currently already an issue that a neutral patch has relatively high probability of failing a simplification test. SPRT {-1,1} would make that worse. The probability of simplifying away a 1 Elo patch away is far less than 50%. |
|
I prefer STC to be smth like {-0.25, 1.75} |
|
@vondele @vdbergh You're right, sorry My statement holds for a +0.5 elo patch, (0.5 lies in middle of [-0.5; 1.5], -0.5 lies in middle of [-1.5; 0.5], so 50% chance of passing STC). Actually with the proposed bounds, it's still less likely that a 0.5 elo patch passes both STC and LTC (with higher bounds than STC) than that a -0.5 elo simplification passes both STC and LTC. Anyway, I think the additional 'constraint' that most simplications don't just revert earlier patches but actually make the code more readable/smaller is helping us here. |
|
If we focus on -0.5 Elo patches, my understanding is the numbers change from 10% STC pass and 1.2% LTC pass now, to 5% and 0.3%, is that correct? Edit: "have been passing LTC" |
|
Do we continue trying to write new patches for Stockfish? Or should we
halt development until we've made changes? I currently have a test pending,
but I'm not sure myself...
Try some simplifications?
|
Revert 5 patches which were merged, but lead to a regression test that showed negative Elo gain: http://tests.stockfishchess.org/tests/view/5e307251ab2d69d58394fdb9 This was discussed in depth in: #2531 Each patch was removed and tested as a simplification, full list below, and the whole combo as well. After the revert the regression test showed a neutral result: http://tests.stockfishchess.org/tests/view/5e334851708b13464ceea33c As a result of this experience, the SPRT testing bounds will be made more strict. Reverted patches: 1 Dynamic Complexity 6d0eabd : STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31fcacec661e2e6a340d08 : LLR: 2.97 (-2.94,2.94) {-1.50,0.50} Total: 38130 W: 7326 L: 7189 D: 23615 Ptnml(0-2): 677, 4346, 8843, 4545, 646 LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c18fec661e2e6a340d73 : LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 38675 W: 4941 L: 4866 D: 28868 Ptnml(0-2): 270, 3556, 11429, 3584, 291 3 More bonus for bestMoves on past PV nodes 71e0b53 : STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31fe93ec661e2e6a340d10 : LLR: 2.95 (-2.94,2.94) {-1.50,0.50} Total: 46100 W: 8853 L: 8727 D: 28520 Ptnml(0-2): 796, 5297, 10749, 5387, 813 LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c187ec661e2e6a340d71 : LLR: 2.96 (-2.94,2.94) {-1.50,0.50} Total: 16920 W: 2161 L: 2055 D: 12704 Ptnml(0-2): 115, 1498, 5006, 1569, 130 4 Tweak Restricted Piece Bonus 0ae0045 : STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31fefaec661e2e6a340d15 : LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 88328 W: 17060 L: 16997 D: 54271 Ptnml(0-2): 1536, 10446, 20169, 10422, 1581 LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c17aec661e2e6a340d6f : LLR: 2.95 (-2.94,2.94) {-1.50,0.50} Total: 34784 W: 4551 L: 4466 D: 25767 Ptnml(0-2): 255, 3279, 10061, 3345, 262 5 History update for pruned captures 01b6088 : STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31ff5eec661e2e6a340d1a : LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 29541 W: 5735 L: 5588 D: 18218 Ptnml(0-2): 483, 3445, 6820, 3469, 545 LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c196ec661e2e6a340d75 : LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 22177 W: 2854 L: 2757 D: 16566 Ptnml(0-2): 143, 2005, 6555, 2055, 164 6 Tweak trapped rook penalty 18fc21e : STC 10+0.1 https://tests.stockfishchess.org/tests/view/5e31ffb1ec661e2e6a340d1c : LLR: 2.95 (-2.94,2.94) {-1.50,0.50} Total: 24476 W: 4727 L: 4569 D: 15180 Ptnml(0-2): 390, 2834, 5659, 2933, 417 LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e32c19eec661e2e6a340d77 : LLR: 2.95 (-2.94,2.94) {-1.50,0.50} Total: 97332 W: 12492 L: 12466 D: 72374 Ptnml(0-2): 690, 9107, 28738, 9034, 720 All 5 as one simplification : LTC 60+0.6 https://tests.stockfishchess.org/tests/view/5e334098708b13464ceea330 : LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 7829 W: 1079 L: 964 D: 5786 Ptnml(0-2): 52, 690, 2281, 781, 65 Bench: 5153165
|
@vondele so what about new bounds? |
|
preparing a fishtest commit right now... |
as per discussion in official-stockfish/Stockfish#2531 introduce new bounds: standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo simplification STC/LTC {-1.5, 0.5} unmodified.
|
Okay, I just started some tests with adjusted bounds (customly selecting them). |
as per discussion in official-stockfish/Stockfish#2531 introduce new bounds: standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo simplification STC/LTC {-1.5, 0.5} unmodified.
as per discussion in official-stockfish/Stockfish#2531 introduce new bounds: standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo simplification STC/LTC {-1.5, 0.5} unmodified.
|
The regressing tests have been reverted 6ccb1ca and the new SPRT bounds introduced official-stockfish/fishtest@cc84d9c This issue has been resolved. I'm sure the process has been useful, and I hope the new bounds help to make progress quickly and in a robust fashion. |
|
I am back after some months of absence, so sorry to react with a lot of delay. This is just my personal feeling about this huge change in SPRT bounds :
So I will kindly suggest a small adjustment : [-0.5, 2] for STC. This will significantly decrease number of games for ~0 ELO patchs which are not a potential big improvements to SF. The drawback is a small increase in +1ELO failures but after all this should be acceptable if we concentrate our efforts and ressources on better patchs. |
|
I'd like to keep the bounds like this for a while to check... in the past days, the number of tests running has been relatively modest, and overall throughput still OK. The bounds are such that 'patches that scale with TC' have a reasonable chance to make STC and perform at LTC. |
|
@vondele For what it's worth, I agree with @MJZ1977. While it might seem true that a modest number of test submissions can support more precise bounds, the problem is that since developers are not blind to the currently running tests, the number of tests running and the number of tests submitted are not independent. The more tests are running, the fewer tests are submitted. I know that I often deliberately forgo submitting tests, or reduce the number of variations I submit, when I see the framework is under heavy load (out of courtesy to other developers). I'm sure I'm not the only one who does this... |
|
Clearly, there is a correlation between running tests and tests submitted. That's natural and good, I think. On a nearly empty framework, there is lots of random stuff, and countless variations being run... that's not necessarily good. Right now, running time of a test requiring 100k STC games is still <24h. I went once through the list of tests that ran since the introduction of the new bounds.... if I did the counting correctly, we had a total of 750 tests running, 50 where Elo-gainer LTC, 3 passed {0.25, 1.75}. So about 6% promote from STC to LTC and about 6% of Elo gainers LTC are successful. If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want? |
|
I don't think that changing from {-0.5, 1.5} to {-0.5, 2.0} will dramatically kill 'good scalers'. In the last 50 patchs how many would have been lost because we have slightly changed these bounds? Now if we had made 1500 tests instead of 750, my feeling is that we should find more successful ELO gainers at the end. And of course we are not in the extreme of last months with STC tests finishing in 20 minutes with empty framework. We can notice that from STC finishing in 20 min to STC finishing in 20 hours there is a big gap :-) |
(emphasis added) I don't think this is correct, though. A smaller proportion of tests pass, but this is offset by an increased number of tests. Counter-intuitively, we should expect more STC greens if we make the bounds stricter (up to the point that developers no longer can submit good tests fast enough to meet the added framework capacity). Please allow me to clarify what I mean: Current STC bounds: [-.5, 1.5] Modified STC bounds: [-.5, 2] The capacity of the framework occupied by running +0 tests would be 1 - 52300 / 86900 = 40% lower. The pass rate for +0.75 Elo tests would only be 1 - 50/67.6 = 26% lower. So the increase in capacity for new tests outpaces the decrease in pass rate. Assuming the distribution of submitted test Elo doesn't change (which is a potentially flawed but unavoidable assumption), we would expect 23% more +0.75 tests to pass, and 32% more +1.00 tests, unless I've made an error in the calculations. |
|
Assuming neutral scaling, we currently harvest just 40.5% (81%*50%) of +1 elo patches. I think at current state strong elo gainers are almost non existent, and elo gainers in general too rare. Personally I feel sad when a parameter tweak or a modest code adder fails at high LTC count. Usually it represents a harmless clear improvement which fell a bit short, and with tons of resources invested. Even by skipping the game-count indication, the +0.5 to +0.8 elo range obviously has very low passing chances. Especially when a solid idea of about that elo range is tried in multiple versions, it consumes extreme resources. I think that an important question is: I assume around 0.5 elo. Obviously due to lack of perfect confidence the bounds also serve the purpose of making it less likely for useless or regressing code to sneak in. Hence the LTC mid-point target of +1 elo, with very high confidence. I still don't understand why we use same bounds for tweaks and code adders. When we are willing to use considerable resources and pay a little elo for cleaner and less code via simplifications, shouldn't that reversely reflect? The way I see it there are 2 safely profitable possibilities:
Elaborating 2. is an effort to find the most efficient way for example to be plus 2-3 elo after 10 tweaks bypassing accurate individual performance. Currently this is scarcely attempted with combos, but this method is clearly inefficient cause firstly normal bounds are tried, then the combo selections are chosen on the shaky basis of sprt gamecount. |
|
The logic of @31m059 is even bigger for +2 ELO and+3 ELO .... Edit : for me we must avoid extremes.
Of course all depends on available cores in the framework. If we have 5k cores, [-0.5, 1.5] is clearly the best choice. |
-2.38 +-4.7 after about 5000 games. obviously still very early, but after 7 elo gainers this is not what I was expecting...
The text was updated successfully, but these errors were encountered: