Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gating or no gating? The simulation advocates for gating with a 50% threshold (vs 55% currently) #1524

Open
Friday9i opened this Issue Jun 5, 2018 · 70 comments

Comments

Projects
None yet
@Friday9i
Copy link

Friday9i commented Jun 5, 2018

Please find attach an Excel file simulating the impact of gating vs no gating, to try to understand the most efficient approach.
Last version of the Excel (18/6/2018):
Gating v4.xlsx.gz

More exactly, it is an approximate simulation of the result of gating depending on the threshold used: is the famous 55% optimal or not? Or should we use another threshold?
Conclusion: it seems a threshold around 50% is the most efficient approach, giving an average ELO boost of ~16,5 points vs only ~13 points with a 55% gating (ie a 25% boost!).
More generally, values between 45% and 53% seems quite close to the optimum (around 16 ELO points).
All in all, it seems to advocate for adjusting the current approach with changing the threshold from 55% to 50% (which is almost equivalent to no gating and the ELO not decreasing more than 30 points)
A positive side effect: it would generate significantly more new networks, possibly giving more diversity to the selfplay games, which should help learning ; -). And less frustration from ‘good’ networks failing to get a PASS ; -)!
A negative side effect: we would get more new networks (which is a bit painful) and not always stronger

Note: the fact that some new networks are weaker is taken into account in that analysis, and it’s not a problem as long as the threshold chosen is above ~45%.

The Excel allows to test cases with a threshold between ~45% and ~55% : -)
How does it work ?
First thing: beware, each time you change something or type F9, all the ‘random()’ re-init and the tests change, with different results… (and it takes a few seconds to compute).
There are 1000 columns to simulate 1000 tests simultaneously and get significant results : -):

  • Line 7: the probability of a new network against the current best network. I choose to spread it evenly between 40% and 60%, which seems reasonably fair regarding the history of networks tested (I ignore nets with a probability below 40%, as they will never get a PASS anyway, so we don’t care! And we never saw nets above 60%, so that must be negligible).
  • Line 8: I copied line 7, and this is the line used to make various tests and adjust the threshold (with net probabilities not changing each time I press F9). If you want to change these values, just copy and paste line 7 on line 8!
  • Lines 9 to 438: those are the cumulative points for the tested network with a 430 test match results, ie 430 tests of the new net vs current net (it goes +1 when the new net wins, 0 otherwise).
  • Lines 440 to 869: the % of the new net after n match games. Hence, line 869 gives the result of the tested net after 430 games (for each of the 1000 nets tested in columns B to ALM). But the PASS test is not done after 430 games, it’s more dynamic. Hence, the lines below
  • Lines 874 to 1303: a simplified PASS test. I do not look at results before 300 games. Above 400 games, if % reach the threshold, the net get a PASS. And from 300 games to 400 games, the PASS threshold is dynamic (coming from the chosen threshold +2%, ie 55%+2%=57% at game 300 currently, down to 55% at game 400)
  • Line 1304: the result of the (simplified) PASS test! If it is 1, the net gets a PASS

From there :

  • The selected networks are visible on line 9 (from line 1304)
  • We can compute the true ELO gain of these selected networks (knowing their “true strength”): info on line 4. Note: networks not selected do not change ELO, of course!
  • Cell B2: the average ELO gain from these 1000 tests!

Each type you type F9, a new test is done with 1000 new matches. And if you copy and paste line 7 on line 8, 1000 new networks are simulated.

What do you think of that?

Note: I know there is a biais… The biais is the following: changing the selection process changes the training data, which should impact the strength of new submitted networks. But how??? There is no way to simulate it, so I ignore that biais.
Gating.zip

@l1t1

This comment has been minimized.

Copy link

l1t1 commented Jun 5, 2018

if games=429, what the value of winrate could pass? @Friday9i

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 5, 2018

400 or 430 doesn't change things significantly, 0.1 or 0.2% probably, I didn't focus on that (and the PASS test simulated is an approximation also, as it's not that important here)
But 50% or 55% threshold changes a lot of things, that's what I simulated
(time to sleep for me, cu tomorrow)

@killerducky

This comment has been minimized.

Copy link
Contributor

killerducky commented Jun 5, 2018

I choose to spread it evenly between 40% and 60%, which seems reasonably fair regarding the history of networks tested

I don't think this matches the distribution we actually have. Maybe you can scrape the match history table. I think more nets are less than 50% than greater. So if you gate at 50%, there will be more false positives than false negatives. In an extreme case if 90% of your nets have a true winrate of 49% and 10% true winrate of 51%, gating at 50% is probably going to actually lose strength over time.

I think the main advantage of gating is to have time to get confidence your training pipeline is doing the right thing. If you can't get a 55% net after a long time, you have time to look at your hyperparams and try improving them. The math you didn't doesn't really account for this advantage.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 6, 2018

@killerducky It's easy from the file to do some tests with other distributions of win rates, I'll do that tonight. True it may give different results, but the distribution you expose is not reflected by the history of nets tested and hence extremely unlickely, while a smooth distribution between 40% and 60% seems reasonable... ; -). Anyway, I'll test some other distribution and will compare the results.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Jun 6, 2018

I ignore nets with a probability below 40%, as they will never get a PASS anyway, so we don’t care!

I really can't see any sense in discarding those. Why not look at the real, actual distribution of net results (but note here that the scores from matches we have is biased because of the SPRT termination, notably the more badly performing the network is, the more it has their match cut short)? What could you possibly achieve by filtering those instead of invalidating your results?

If you are simulating uniformly distributed networks around 40%-60%, then I would expect any gating that is >50% to make forward progress. I can't really see how you could get any other result?

The SPRT and gating we have is very strongly oriented towards avoiding false positives while ignoring false negatives (because we expect strength to keep increasing, and the next network to pass), which is pretty clear because the condition is [0, 35] and not symmetrical around 0. The underlying assumption is that more networks are expected to be worse than better (contrary to the uniform 40-60% assumption). I'm not sure that is currently true, especially because SWA has stabilized things a lot, and because obviously always-pass worked for Alpha Zero. On the other hand, Leela Zero Chess hasn't really had undivided success with that method.

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 6, 2018

Leela Zero Chess hasn't really had undivided success with that method.

Neither does minigo.
That led me to a question to those who follow closelly the AI scene : did Deepmind released anything more that their pre-print of alphaZero ? Or announced a date ?

@john45678

This comment has been minimized.

Copy link

john45678 commented Jun 6, 2018

I think both Leela Chess and Minigo would be better with gating. Watching both, there seems to be a up and down 'flow' with the (self play) ratings between models and its very hard to know which model is best. (and progression seems difficult with minigo).

@stephenmartindale

This comment has been minimized.

Copy link

stephenmartindale commented Jun 6, 2018

Since I already have a database of scraped match data which contains results for the last 288 matches, I thought I'd take a look at this. Here are some histograms that show that the win/loss rate is certainly not uniformly distributed and not really normally distributed around 0.5, either.

image

image

The vertical lines are at 0.5 and 0.55, respectively. Note that I did include matches that were cut short by SPRT and so the data will be biased as @gcp mentioned, above. Additionally, I made no effort to filter out irrelevant matches such as those with the 20-block networks or the ELF networks.

Interestingly, although the data does appear to be skewed, it isn't nearly as dramatically skewed as I expected. To give an exact number: 148 of the 288 matches were below 50%. I expected more.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 6, 2018

@stephen thanks a lot I'll update the simulation with these data I hadn't before, great

@Ishinoshita

This comment has been minimized.

Copy link

Ishinoshita commented Jun 6, 2018

While this discussion certainly has merit and may help speeding up LZ progress, and while I like LZ Elo graph much better than LCZ one, I note a kind of paradox.
LCZ project exhibits a flattening, lately again plunging self-elo trend but advertises a 'stay calm' message (~'there are plenty of evidences that bad days are behind us and that LCZ is making progress again). Confirmed by matches against other reference programs, test suites, etc....).
On the contrary LZ has a nice Elo graph showing steady progress. But, over roughly the last 500k selfplay games, there has been a +400 self-Elo gain that does not seem to translate into playing strength increase vs external references. Few gains against ELF 5417b (probably less than +100 Elo) and few gains if any measurable from Petgo3 on KGS (somewhat clamped just below 8dan since ~mid April). Do we have any other extrinsic reference to gauge LZ recent progress?

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Jun 6, 2018

But, over roughly the last 500k selfplay games, there has been a +400 self-Elo gain that does not seem to translate into playing strength increase vs external references.

Score against ELF went from 5% to 14% or so?

It's hard to compare against other things due to lack of references (not a problem in chess...). CGOS is not so good, KGS is even worse...

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Jun 6, 2018

did Deepmind released anything more that their pre-print of alphaZero ? Or announced a date ?

Their complete paper got hung up in peer review, but that will not last indefinitely.

I am not sure if you will find a magic explanation there, aside from "we tried a number of things until it worked".

@PhilipFRipper

This comment has been minimized.

Copy link

PhilipFRipper commented Jun 6, 2018

If we want to experiment with these kind of things, shouldn't we do so with much smaller networks first? I know not all things may scale to larger networks, but the cost of testing is much lower.

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 6, 2018

I am not sure if you will find a magic explanation there, aside from "we tried a number of things until it worked".

Yeah maybe, but i can't help but think something is not quite right with minigo or leelachess, like something is missing or certain parameters are not right for the job. But yeah maybe we could have no answer whatsoever.

Did some managed to reproduce AlphaZero ? like get the same amazing results that a lot of people got from AlphaGo and AlphaGo Zero papers ? I don't think i saw anyone who managed to do it even among all the recent go AIs that poped up. It seems almost all of them are AGZ-like (or some are AGmaster-like).
I could be wrong but it's pretty interresting.

If we want to experiment with these kind of things, shouldn't we do so with much smaller networks first? I know not all things may scale to larger networks, but the cost of testing is much lower.

I think no gating AlphaZero-like already worked with little net on 9x9 board but failed to scale to 19x19. So it might not be conclusive. (minigo for exemple achieved pro-level on 9x9 but failed several times on 19x19 if i remember correctly)

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Jun 6, 2018

The Alpha Zero setup for sure is a bit more brittle than the Alpha Go Zero one, so it stands to reason that small differences could suddenly end up fatal. The game generation speed vs training speed for example implicitly controls over-fitting.

If the paper gets released, the most interesting part is going to be if they documented what did not work.

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 6, 2018

if they documented what did not work.

I think it would be too good to be true.

Contrary to certain people, i find it pretty good and healthy that they did not released any alphago weigh.
However they could have at least, following their papers, answer publicly (as a kind of public follow-up) questions that a lot of people have about their methods and what failed and why. Especially since there are part of their papers that are not very clear, hard to understand, and prone to interpretation.

@Ishinoshita

This comment has been minimized.

Copy link

Ishinoshita commented Jun 6, 2018

@gcp Agree. I was just misleaded by the graph showing Elf flying more or less 350-450Elo above the network, like chasing a moving a moving target, the gap being only slowly reduced over time and not in proportion of the +400 self-Elo gain over the last 500k games period.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 6, 2018

@stephenmartindale, would it be possible to provide an excel with the info?
@gcp, filtering network below ~40% has no impact at all on the results... Very bad network will not get a Pass even if lucky, so whatever the threshold selection (45%, 50% or 55%), they will not be selected. Hence, they will have no impact on the Elo gains, so they can simply be ignored.

Ps: my PC crashed badly, probably power supply ko, new tests will have to wait for a few days, damn :-(.

@TFiFiE

This comment has been minimized.

Copy link
Contributor

TFiFiE commented Jun 7, 2018

But, over roughly the last 500k selfplay games, there has been a +400 self-Elo gain that does not seem to translate into playing strength increase vs external references.

Score against ELF went from 5% to 14% or so?

That would suggest a self-play inflation factor of about 2, which notably closely matches the suggested formula to adjust the self-play ratings of Leela Chess Zero (x*0.6-645).

@stephenmartindale

This comment has been minimized.

Copy link

stephenmartindale commented Jun 7, 2018

Ah. Sorry, @Friday9i. I did realise that you would probably need the raw data and I posted a link to the iPython code and attached the Sqlite database but accidentally did so on the wrong GitHub issue.

To avoid uploading the attachment again, here it is: #1526 (comment)

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 7, 2018

BREAKING NEWS: I updated with the true (estimated) net probabilities from match data, and it gives comparable results :-)
With a 55% gating, the average gain is ~7,6 Elo per new net, while a 50% gating gives an average gain of ~12 Elo per new net, which is a much faster improvment!
Note: the Elo improvment is different from the one I got with the basic flat distribution, which is logic (I calculated the average Elo gain while excluded "very bad" nets, so the average is of course lower when you take them into account) but all in all, it confirms that a gating with a 50% threshold is more efficient than a gating with a 55% gating, with real historic data.
I chose to exclude nets with very high win rates, as those are from larger nets and it doesn't reflect the "everyday" test matches with same size nets.
File with calculations attached. Its structure is the same, I just added a few lines at the top of the 'gating' table (where you can find the results), and I added a new 'Distribution' table to compute the net win rate from a random number between 0 and 1 based on the real distribution of match win rates (data from @stephenmartindale, thanks again!).

Gating v2.zip

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 7, 2018

Additionnal results:
55% gating, average ELO gain: ~7.6 Elo per new net
52% gating, average ELO gain : ~11.3 Elo per new net
50% gating, average ELO gain : ~12.0 Elo per new net
48% gating, average ELO gain : ~11.3 Elo per new net
And instead of an average of 20% promotion rate for new nets (with the current 55% gating), with a 50% gating about half of the tested nets would be promoted on average (from the beginning).
image

@gjm11

This comment has been minimized.

Copy link

gjm11 commented Jun 7, 2018

I think your calculations assume that the distribution of new:old ELO difference doesn't depend on what promotion policy led to the promotion of the old network, but surely that isn't true. And I bet the actual biggest way in which changing promotion policy might change LZ's rate of improvement is via the effect on the network's training of feeding it games from networks whose strength varies in different ways; my guess is that more permissive promotion would probably improve that, but it's very much not obvious.

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 7, 2018

And I bet the actual biggest way in which changing promotion policy might change LZ's rate of improvement is via the effect on the network's training of feeding it games from networks whose strength varies in different ways

Yeah It's pretty much the whole debate : does lowering the gating would improve the training with the increasing diversity of games, or get it worse because of the lowered quality of games.
And the problem with getting it too low or removing it might be the thing that happened to minigo or LeelaChess : not only stagnating but getting worse and worse, and getting very big elo drops with a few bad cycles.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 7, 2018

@gjm11 Indeed, that's exactly the sense of the note at the end of the "post issue" all above ; -)
Note: I know there is a biais… The biais is the following: changing the selection process changes the training data, which should impact the strength of new submitted networks. But how??? There is no way to simulate it, so I ignore that biais

And I also share you view that more frequent new networks would also increase the diversity of selfplay games, which should improve training! Still, that would change the distribution of new nets strength, which will have an impact that I cannot simulate (as we don't know the future dstribution we'll get!)

@gjm11

This comment has been minimized.

Copy link

gjm11 commented Jun 7, 2018

The thing is, I think the thing you're ignoring (for the excellent reason that no one understands it well enough to simulate it) is probably the main thing that matters.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 7, 2018

@gjm11 : yes and no IMHO.
It matters a lot indeed, but the results seem to be solid and qualitatively unchanged by 2 very different distributions of strength (of tested networks vs current best network), as seen with the initially tested flat distribution and with the real (and clearly non-flat) distribution.
Hence, unless the new gating approach changes drastically the distribution of strength of new networks vs current network in an adversely way, the result should be unchanged: a 50% gating threshold seems to be more efficient than a 55% gating threshold!
As discussed on discord, I'll do some additionnal tests with other distributions of strength, to see if these results holds : -)
My guess: the only case where a 50% gating does NOT perform well is when many tested networks are just a bit weaker than the current one (ie 45% to 49%) and very few are better (above 50%). In that case, a significant proportion of the slightly weaker networks get promoted by chance, weakening the ELO, and this would not be compensated by the stronger but rare networks.
The good news is, I see no reason to get many nets "just a bit weaker" and very few "just a bit stronger", and historical data does not show that either.
And by the way, Leela Chess is way more aggressive in its approach: a new net is promoted if ELO does not decrease by more than 30 points (if I'm not wrong), which correspond to a 45% threshold gating, and that can lead to a decrease of strength as they saw recently. Here, data seems to illustrate that a 50% gating is optimal, and unless the distribution is extremely asymmetric around 50%, that should not lead to a decrease of strength.
So all in all, this seems to suggest to try to lower the gating from 55% to around 50% (and not as low as 45%, as done by Leela Chess, which gives not-so-good results... Only Deepmind's AZ managed to get good results with it, and LC and Minigo from time to time, but with regression also from time to time. So 45% gating is a more extreme and risky path).

@gjm11

This comment has been minimized.

Copy link

gjm11 commented Jun 7, 2018

I still think something has to be wrong with your reasoning, and here's why.

So far as I can tell, everything we know is consistent with the following story: network strength depends on (1) how many games the network has been trained on, plus (2) random noise. In this case, we get a sequence of candidate networks whose relative strengths are somewhat random in the short term, but whose long-term progress doesn't depend at all on the promotion criterion.

If that story is true (note: I think that story is probably not quite true; but, again, I don't know of any concrete evidence that refutes it), it cannot possibly be the case that any change in promotion criterion makes any difference to the long-term rate of improvement. Because that long-term rate depends only on how long the thing's been training for, and changing the promotion criterion doesn't change that.

So, if that story is true, there might be changes to the promotion criterion that make promotion rarer and make it produce proportionally bigger average gains per promotion; or that make it more frequent and make it produce proportionally smaller average gains per promotion; there might be changes that result in our noticing improvements faster and therefore (e.g.) always being a few ELO points ahead, on average, of where we would be with a "slower" criterion; but nothing we do will change the rate at which LZ improves.

I repeat that my guess is that this pessimistic story is not true; it seems entirely possible to me that more permissive promotion will increase diversity and make the most recent training games be played by a very slightly stronger network, and that both of these will speed up training. But we don't know either of those things, and you quite rightly aren't claiming to be able to model them. And in the absence of such effects, the long-term average effect on our rate of improvement of any change in when we promote has to be zero. And if your calculations say it isn't zero, then either (1) they're wrong somehow, or (2) they are making some sort of assumption about what affects the network's rate of learning, or (3) something is wrong with my reasoning above.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 7, 2018

@gjm11 You may be right...
But I'm in line with your last paragraph rather with the previous ones ; -)
An example why it could matter: 2 days ago, a 54.9% network was rejected. For 2 days we train with LZ147 where we could train with a (very probably) significantly better network. And a day before, we did the same with 2 other 53+% networks. Hence, we currently train with less diversity and a weaker network: isn't it a pity? Not training with better (and available) networks than the current one has for sure has some negative impact on training, slowing down improvements...

@Marcin1960

This comment has been minimized.

Copy link

Marcin1960 commented Jun 7, 2018

What about a compromise? :) Like 52%?

I personally like 54% or 53%.

@jkiliani

This comment has been minimized.

Copy link

jkiliani commented Jun 8, 2018

Why not? I just think it's a pity not to test something quite promising just because it cannot be 100% proved it is better ;-(

A lot of people, me included, thought the same way about some ideas related to the training pipeline, but for obvious reasons, @gcp uses very high standards of evidence for any changes to the pipeline. A circumstantial analysis that leaves out some key aspects is not very likely to be actually tried. If we had a parallel, experimental pipeline, more such ideas might be tried at some point, but for now that's not the case.

@john45678

This comment has been minimized.

Copy link

john45678 commented Jun 8, 2018

" If we had a parallel, experimental pipeline". Maybe something like LeelaChess is doing now?

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 8, 2018

@PhilipFRipper : you tell me to go f... myself, what to say? A bit inelegant. Ignore.
@jkiliani : I know the high standard, I understand it, that's why I try hard to get more specific elements. In the current case, unfortunately, there is no way to assess the impact more precisely: we have hints but the only true possibility is to try (or not ; -).
@herazul : regarding the risks, a 50% gating cannot lead to "elo falling hard", just impossible. Slight falling yes, hard fall no! Overfitting and overtraining: what is the link, I just propose to adjust the selection process, how can it lead to that...? Will it lead to plague and cholera too ; -)? Come on...
Anyway, I gave my arguments, nothing else to add

@tapsika

This comment has been minimized.

Copy link

tapsika commented Jun 8, 2018

If we had a parallel, experimental pipeline, more such ideas might be tried at some point

This is something I actually find surprising not to exist yet. Since the project will likely run for a few years at least, having a secondary insurance plan seems like a good idea. Whether the computing power should be divided 90%-10% or 80%-20% between the main branch and the experimental one is debatable of course, but the current 100%-0% setting seems risky.

About 50% gating: as long as there are regular promotions, you don't lose much in reality even with 55%. Search+net is much stronger than the net itself, so the training should make decent progress even if the selfplay net is not the latest (search results remain a good target).

@lightvector

This comment has been minimized.

Copy link

lightvector commented Jun 9, 2018

I did some independent back-of-the-envelope numerical experimentation and it suggests 55% gating is quite conservative, but also isn't vastly slower than a more permissive gating in adverse conditions.

Simplifying assumptions:

  • The true probability that the next net wins a game against the current best net is normally distributed with mean mu and standard deviation sigma.
  • You always run a full 500 games (no early-stopping) and accept a net if and only if the percentage of games won is >= the gating percentage.
  • Given mu and sigma, your goal is to choose a gating percentage to maximize the expected number of true Elo points gained per net tested. (It's 0 if you don't accept the new net).

With these assumptions, given mu and sigma, you can actually just write the formula to calculate the probability of accepting a new net given conditional on its true probability being a certain value, and numerically integrate to find to the expected true Elo gain (or loss) per net. Under these idealized assumptions, I did some playing with the numbers. It's definitely possible to find values of mu and sigma for which 55% is the correct strategy, however they need to be quite adverse and don't easily resemble the empirical data posted by @stephenmartindale.

For example:

  • If mu = 0.4 and sigma = 0.05, this is a very adverse case where generating a truly better net is a 2-sigma tail event that happens less than 3% of the time, and yet generating nets close to but under 0.50 true win rate is decently common. In this case you do find that a 50% gating would make very poor progress, as expected, due to accepting too many false positives (and of course having no gating at all would be a disaster). Surprisingly though, under this highly pessimistic case, approximately a 52% gating is optimal!
  • If mu = 0.2 and sigma = 0.1, approximately a 51.5% gating is optimal. Surprisingly, in this case a 50% gating actually makes almost as much progress, being only about 15% worse in expected Elo gain.
  • If mu = 0.46 and sigma = 0.02 this gives a very sharp peak where almost all the mass is just below but still quite close to 0.5. In that case, 55% gating is actually roughly optimal. The cases where 55% gating or more extreme are optimal appear to be these kinds of cases, where mu is actually very close to 0.5 rather than lower, but sigma is very very small, so that you have an extremely high density of slightly truly worse nets, coupled with an extremely low density of truly better nets.

I went in to this thinking that even under these simplified assumptions it would be easy to construct cases where high gating thresholds were ideal, but was surprised at quite how adverse I needed to make the distribution to do so. I think of course 55% gating is very safe. In the majority of semi-realistic-but-still-fairly-adverse distributions (i.e. being not too far off of stephenmartindale's data), it costs you up to perhaps 30% expected rate of true Elo gain of optimal, (optimal is often around 51% or 52%), but will never have a high enough false-positive-acceptance rate to even be the slightest bit worried. If you're worried additionally the above assumptions being wrong in ways major enough to eliminate the large safety margin that these calculations suggest you should have, or you think of Leela Zero foremost as a controlled run of a scientific experiment, it's not unreasonable to just continue as-is.

One last thing to note - while the difference between 50% gating and 55% gating is noticeable, the difference between no gating and 50% gating is massive. I see many posts in this thread where people have been worried that that decreasing the percentage from 55% is a riskily large step toward no-gating, but while it's certainly a risk, it's definitely the smaller step. Certainly in the above examples I tested, no-gating would be an utter disaster with a negative expected Elo gain, whereas 50% gating usually still makes progress even with fairly adverse distributions, despite accepting more false positives than optimal.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 11, 2018

Basically your experiment gives comparable results:

  • A gating between ~50% and ~52% seems more efficient than 55% gating, unless the distribution is very adverse, which is not seen in current matches results (and even if it were the case, that would not lead to significant decrease, just to a slower increase. Hence, risks of testing are very limited!).
  • On the other end, no-gating is very much more risky (as seen with LC0...)

But the official position from @gcp apparently remains: "The SPRT and gating we have is very strongly oriented towards avoiding false positives while ignoring false negatives".
So be it

@Ttl

This comment has been minimized.

Copy link
Member

Ttl commented Jun 12, 2018

Minigo, LC0 and AZ all made progress with no gating. If the model says it doesn't work then I would say that the model is wrong. It's hard to trust the conclusions if the predictions don't match with the observations.

I don't think it's correct to model winrate as constant normal distribution. I would assume that the amount of new data the network has seen has a big effect on the winrate but the constant model completely ignores it. Modeling the winrate as linearly dependent of amount of new data and normal distribution on top of that seems like a better model. Not sure if that is good enough either.

Training data window also does some filtering. With no gating if the new network is weaker than the current best it probably doesn't matter that much with a big window size since the amount of new data from the new network is low compared to the all data. If the new network is better than the oldest network in the window it still improvement in the average quality of the training data and the strength keeps improving.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 12, 2018

@Ttl, regarding no gating "If the model says it doesn't work then I would say that the model is wrong": sorry but no, you completely misunderstood what we wrote...
The model says it's more risky without gating: with some distributions (of strength of new networks vs current network) it works, while with others it doesn't work well, leading to a decline of strength. This is exactly what LC0 and Minigo experienced.
The model also explains that (almost) whatever the distribution, 55% gating works well (we experience that also).
Hence, it's rather a confirmation that the model seems to work well (not the opposite).
Then, the model also add 2 other things:

  • 55% gating works well but seems not to be optimal
  • a ~50% to ~52% gating should be more efficient, and is an approach "in-between", with more efficiency, with a small price to pay : a little bit more risk (of stagnating or small decline, no real risk of "downward spiral")

Clearer?

PS: the model "gating v2.zip" above allows you to test whatever distribution you want and whatever gating threshold you want (between 45%, ie comparable to no-gating and 55%, the current gating), and see by yourself if it works well or not so much : -)

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 12, 2018

Minigo, LC0 and AZ all made progress with no gating

Minigo and LC0 did for sure, but they also ran into a lot of problem, and are still very VERY far from LZ progress. It's why there is no consensus about this subject yet.

a ~50% to ~52% gating should be more efficient

We have no proof of that.

a little bit more risk (of stagnating or small decline, no real risk of "downward spiral")

maybe, difficult to tell what can happen if we come close to optimum on current net size and parameters and there is a lot of 45 to 50% net work that can false promote, and very few 50 to 55% net.
Maybe it's worth trying i dont know.
Or maybe it could be something like 52% or 53% with more test games to reduce risk (like going for 600 or 800 if a net is on 52-55% at 400 to be more sure that it's not a 49% net that went well.
Ultimatly it's GCP's decision and i trust his judgment on this.

@l1t1

This comment has been minimized.

Copy link

l1t1 commented Jun 12, 2018

in recent three rounds, it seems the new weights are weaker than before

Start Date Network Hashes Wins / Losses Games SPRT
2018-06-12 21:27 24493eaa  VS  d0187996 63 : 94 (40.13%) 157 / 400 fail
2018-06-12 16:54 a07be14a  VS  d0187996 142 : 150 (48.63%) 292 / 400 fail
2018-06-12 12:22 b29de6f6  VS  d0187996 37 : 64 (36.63%) 101 / 400 fail
2018-06-12 07:50 6407b440  VS  d0187996 42 : 75 (35.90%) 117 / 400 fail
2018-06-12 02:06 7fe5fd03  VS  d0187996 60 : 87 (40.82%) 147 / 400 fail
2018-06-11 16:59 be56f8ee  VS  d0187996 169 : 171 (49.71%) 340 / 400 fail
2018-06-11 07:53 7ae0b679  VS  d0187996 173 : 171 (50.29%) 344 / 400 fail
2018-06-11 03:20 96d0a7da  VS  d0187996 110 : 129 (46.03%) 239 / 400 fail
2018-06-10 22:47 0c004c40  VS  d0187996 185 : 180 (50.68%) 365 / 400 fail
2018-06-10 18:15 32aa7140  VS  d0187996 47 : 79 (37.30%) 126 / 400 fail
2018-06-10 12:21 e760393b  VS  d0187996 141 : 148 (48.79%) 289 / 400 fail
2018-06-10 03:14 da78b580  VS  d0187996 171 : 163 (51.20%) 334 / 400 fail
2018-06-09 18:08 7f95fbe0  VS  d0187996 105 : 125 (45.65%) 230 / 400 fail
2018-06-09 13:35 9922e314  VS  d0187996 120 : 136 (46.88%) 256 / 400 fail
2018-06-09 09:03 7aa6918b  VS  d0187996 155 : 159 (49.36%) 314 / 400 fail
2018-06-09 04:31 ceba7884  VS  d0187996 139 : 148 (48.43%) 287 / 400 fail
2018-06-08 22:42 0acbb51a  VS  d0187996 176 : 168 (51.16%) 344 / 400 fail
2018-06-08 13:36 d13dddda  VS  d0187996 157 : 162 (49.22%) 319 / 400 fail
2018-06-08 04:31 d0187996  VS  10bc1042 233 : 171 (57.67%) 404 / 400 PASS
@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 12, 2018

Much updated Excel! You can:

  • enter the distribution you want (either the "real one" from @stephenmartindale, thx again, or one specified by yourself, ie the one you want to test!!!)
  • enter the "gating threshold" you want to test (eg 55% for current gating, or whatever % you want)

And you see the average ELO gain in these circonstances. Once you fix the distribution, you can see the risks with low threshold (it can generate negative Elo gain...) and the non-optimal results from too-high threshold (little Elo progress, as many good but not good enough networks are rejected).
All together, make some tests by yourself and you'll see that:

  • gating threshold around 50%-52% seems optimal in most cases, with (very) little risks to decrease in strength
  • gating threshold at 55% always work, but is somewhat inefficient ; -(, because a little too many good networks are rejected
  • gating threshold below ~48% or "no gating" (quite equivalent to 45% gating) are risky and probably less efficient: it works if the distribution of trained networks is favourable (for AZ, less clear and variable for Minigo and LC0...) and it can lead to decrease in strength in unfavourable circumstances (eg if many nets trained are sligthly worse and few are better)

@herazul, @Ttl and others, please test it by yourself and tell me what you think from the results. It goes completely in the same direction as @lightvector : -)

Here are the result for one "unfavourable" distribution, with many not-so-good new networks:
The distribution:
image

Results:
image
Comments:
Despite the unfavourable distribution, 51% is the optimal threshold! A 55%-threshold is 30% less efficient and no gating (ie 45% threshold) is very bad, with a negative performance.

The file (with many comments explaining how it works, please read them!):
Gating v3.zip

@odeint

This comment has been minimized.

Copy link

odeint commented Jun 12, 2018

There are still these two effects which are IMHO significant:

@Ttl :

I don't think it's correct to model winrate as constant normal distribution. I would assume that the amount of new data the network has seen has a big effect on the winrate but the constant model completely ignores it. Modeling the winrate as linearly dependent of amount of new data and normal distribution on top of that seems like a better model. Not sure if that is good enough either.

In other words, even if a network "unjustly" and narrowly fails to promote, usually not much is lost. The next cycles will become stronger and stronger due to being more and more new training games ahead of the current champion. A stationary win rate distribution of any kind does not capture this.

@jkiliani

What seems like a safe bet to me though is that this change would undermine the effectiveness of the training resetting to prevent overtraining. Currently this works well, since the reset to best network after each training run effectively limits the amount of training to only as much as produces a statistically significant increase in strength.

In other words, if a whole cycle does not promote at all, we go back to the old champion network and train it on the new window of games that have been accumulated. Every marginal promotion fixes all the learning done up to that point. I used to think that this wouldn't matter much as more new training could overcome some "bad learning". Now I'm a bit more wary, since the Leela Chess people were not able to learn their way out of a dead end, even after many millions of games and several training window lengths. The protection from overtraining that rare promotions and network resetting grant might be more valuable than we think. The chess people now had to do a new bootstrap and lost hundreds of Elo and millions of games in the process...

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 18, 2018

Following @herazul comment (from #1553), you'll see below I fully took that into account the real vs measured distribution of networks strength!

Here is a synthesis of the global approach:

  1. I started from a distribution of strength of new nets vs the current net.
    We don't know that distribution: we just have the history of measured strength (thanks to @stephenmartindale), not the history of real strength... If a net has a true winrate of 52% for example (after millions of games, which we will of course not play), it will in the end (after 400 games) probably appear as a net with a winrate anywhere between 47% and 57% (and sometimes outside this range!). That's what is measured by @stephenmartindale distribution.
    Because we don't have better info, and because the random noise is 0 on average, it's reasonable to make the hypothesis that the measured distribution is "quite comparable" to the true distribution, hence, it's a quite good proxy for the true distribution of strength (of new nets vs current net) seen in the previous months.
  2. From there, we can make a model of gating process efficiency (which I did in the Excel), starting from any distribution and more specifically for the true distribution (ie the proxy from point 1), modelling what we would observe (the measured distrib, polluted by a random noise after 400 games matches) and the results we would get with various thresholds. And the results show that a 50% or 52% gating is more efficient than a 55% gating. So I fully take into account the true vs measured strength of new nets: no confusion on that fundamental point, and you can verify it in the Excel ;-).
  3. Limits of the approach: there is a biais in the analysis!
    Unfortunately, there is a fundamental biais, which I refered to from the beginning (ie the note at the end of the first post): changing the gating process has for sure an impact on the distribution of strength of new nets vs current net, and we have no way to quantify that impact...
    The only way to get the definitive answer would be to re-test the whole chain: re-do the LZ process from scratch, with a different gating approach but the same training approach, and that is simply unrealistic!
    So, should we stop there? Mathematically speaking, yes! But machine learning is not an "exact science". Hence, could we still get a reasonable hint (leading us to think that a 50% gating could be more efficient than 55% gating)?
    I think the answer is yes, and here is why:
  4. The efficiency of a 50% (or around 50% gating) seems to be robust, it does not depend a lot from the distribution of strengths ;-)!
    I tested it for a flat distribution of strength, and it works: a 50% gating is significantly more efficient (this is quite obvious. And btw, that is even mathematically provable).
    I tested it with @stephenmartindale measured distribution, and it works well too, confirming a 50% gating is more or less optimal (anywhere between 48% and 52% seems good, and significantly better than 55%).
    And I tested it with several other distributions (gaussian with various mean and variance), sum of 2 different gaussians, ad-hoc distributions, etc...
    All-in-all, in almost all the cases, a 50% to 52% threshold seems more efficient than 55%, and almost never "risky" (ie leading to a decreasing strength), so it cannot realistically lead to a significant loss of strength. And btw, that analysis confirms that AZ's no-gating approach IS much more risky, leading to decrease in strength for several reasonable distributions of strength: AZ managed to avoid that pitfall while Minigo and LC0 had problems from time to time (even though they managed to make progresses overall).
  5. Partial conclusion: a lower gating, around 50%, is probably more efficient, with little risks.
    Are there other effects to take into account? Yes...
    It will generate much more frequents new networks (from history, 40% to 50% of tested networks would promote with a 50% gating, vs around 20% of the networks with 55% gating), and that will impact selfplay games: they will include more diversity (eg more josekis would be tested along the way, as each net has it's own preferences. Hopefully, the net may become "more human" thanks to that, playing many more different josekis, which would be nice!)
    These additional nets will also explore a wider part of the net-space: the travel will be a bit noisier, but that should help the search for a better maxima, probably helping training.
    I do not see adverse effects on training, but comments from specialists are welcome!
    A last point on risks of decrease in strength. There is a positive retroaction that should limit the (small) risk of going down: even if we are unlucky with 2 or 3 promotions of weaker nets, selfplay games from previous nets (with better moves) should help training. Hence the more we go down, the more it will help to go up! So risks seems very low all together with ~50% gating.
  6. Conclusion
    This is not maths! And it's not possible to do it as a math theorem!
    But this analysis seems to be reasonably complete, and it gives a strong hint that a lower gating would probably be a bit more efficient. How much? Hard to say, I'd guess anywhere between a few % and 20% more efficiency.
    Risks seem limited, and we can adjust it by selecting the gating: 48% gating is efficient but a bit risky, 50% gating seems as efficient and less risky, 52% seems almost as efficient and much less risky (and the current 55% is very low risk, but apparently significantly less efficient).
    All in all, I'd go for either 50% or 52%.
@herazul

This comment has been minimized.

Copy link

herazul commented Jun 18, 2018

I don't understand this part :
Because we don't have better info, and because the random noise is 0 on average, it's reasonable to make the hypothesis that the measured distribution is "quite comparable" to the true distribution, hence, it's a quite good proxy for the true distribution of strength (of new nets vs current net) seen in the previous months.

How do you acount for the variance ? Maybe the distribution is right, but how do you account that for a 52% gating, you will havbe some 48% absolute WR network that will pass, and the next 52% that will fail ? Because with a 400 test match number, that will happen A LOT.
I dont understand hopw do you mathmatically account for that on your excel ?

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 18, 2018

@herazul : let's suppose the true strength (after an infinite number of match games!) of new nets is an arbitrary distribution, eg a gaussian with a mean of 0.4 and a variance of 0.1. What will be the measured strength after 400-games matches? It will be a gaussian with a mean of 0.4 and a variance a bit larger, probably around 0.12 or 0.15 (I could compute it, but it's painful, sorry! And btw, the excel above gives you the opportunity to test it by yourself: enter that distrib and see the measured one, and recover its mean and variance parameters).
More generally, whatever the true distribution, the measured one will look very similar, just a bit more smoothed and spread (by the noise). But all together, it will be "quite similar" to the real distribution. Hence, the reverse operation is easy: the true distribution should be quite similar to the measured one ;-). In reality, it should be a bit more spiky as the measured one is smoothed, but it's probably a second order impact, and there is significant other uncertainties in other parts of the experiment, so this approximation seems reasonable.
Moreover, the whole process described above shows the result hold for many distributions very different from each others, so a small difference between the measured one and the real one should not have a big impact, the results seems robust to this (unfortunate) effect.
And for the Excel, easy: I suppose the true distribution has a given form, I randomly chose 10000 nets with probability according to that distribution, then for each of this network I randomly choose 400 numbers to simulate game results, and I apply the SPRT test (a simplified version to be precise): hence, I know which nets are selected or not, and their real strength. Some will be weak but selected because they are lucky, others will be strong and not selected, that's life, what counts is the average results, that's what I'm computing for the 10K nets, depending on the threshold chosen (50%, 52%, 55%, other values). And then, I play with other distribution, and I see what is the most efficient. And the answer is most of the time "around 50% or 52%"

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 18, 2018

@Friday9i What i don't understand is :
Let's say you have a distribution of nets (tested with an infinite amount of test match) with a mean of 50% WR and a 10% variance. Then you take a 50% gating and the training will obviously be faster with no downside because you will have no worse net in the training process.
Now do the same thing with 400 test match : you will have almost the same distribution, but then if you apply a 50% gating, you will have a lot of <50% real winrate nets in the training, and a lot >50% real winrate out of the training, that will maybe mess up the training really bad.
So with the same distribution and %WR gating, i don't understand how you account for the false positives in case of a 400 test match stetup. And i don't know how that's accounted for in the excel.
(And i don't know how your answer adress this concern)

@PhilipFRipper

This comment has been minimized.

Copy link

PhilipFRipper commented Jun 18, 2018

To be fair, a mathematically specialized? person should look at it instead of me, or most of us.

People around here are quite empirical. So best bet is to recruit someone who is willing to test the idea on a small net for you, Friday9i. I would offer myself, but I'm even less useful than everyone else. I read every single post, but I'm concerned with go, and not particularly good with computers, programming, or mathematics.

In lieu of an empirical experiment, perhaps a criticism from another mathematics person would serve. Could you ask anyone? A little bit of on site peer review to make up for our lack of sophistication.

My vote doesn't count, because I have nothing to back it up. So I try not to speak much. Sorry for interrupting.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 18, 2018

@herazul, I tested with the Excel:
Input distribution (I took only 0.05 variance, otherwise with 0.1 you get 15% of nets above 60%, which is very unrealistic :-), here it is:
image
Note: I split it by 2.5% segments, so the max appears to be between 50% and 52.5%, but mathematically speaking, it is working (half of nets below 50%, half above).

And the result is seen here, ie the average Elo gain depending on the chosen threshold, the max is around 50% (once again :-):
image
Btw, in this (not so realistic case), no-gating would work perfectly. Not optimal, but working well!

@lightvector

This comment has been minimized.

Copy link

lightvector commented Jun 18, 2018

@herazul - With 50% gating and 400 test matches you will definitely accept false positives. The claim is exactly that with 50% or 52% gating, you will have few enough false positives that you will gain strength faster because you also lose fewer genuinely good nets and this will more than compensate. If you do the math, this is true under an idealized model like the one Friday9i tested (and his math does account for the false positives), the main issue other people have raised is that there is no empirical data to verify the theoretical argumentation.

@PhilipFRipper

This comment has been minimized.

Copy link

PhilipFRipper commented Jun 18, 2018

oo, just a quick question, why does the effectiveness with 60% gating end up lower than no-gating in your example? That's non-intuitive to me. Don't have to answer! Just curious how that works out.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 18, 2018

Just corrected a mistake I did, graph is updated above ; -)
And here is the last version of the Excel (correcting the error)

Gating v4.xlsx.gz

@lightvector

This comment has been minimized.

Copy link

lightvector commented Jun 18, 2018

@PhilipFRipper - Under @Friday9i's test scenario 60% gating rejects 90%+ of good nets too by being too strict, slowing improvement down so much that it becomes slower than no-gating in that scenario.


Since the discussion seems to be running in circles a bit now - I think nobody is contesting that within the assumptions of such a model it's robustly the case that 52% is better than 55%, everyone who has modeled it has found a similar result, and nobody really disagrees.

But the standard for making a change is to produce empirical test data (not an unreasonable standard to have in general), so for better or worse the ball is firmly in the court of people who favor reducing the gating to rally the computing power to fork the training pipeline and/or actually produce a test of the idea. Probably more excel simulations aren't going to convince anyone not already convinced.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 18, 2018

indeed @lightvector : -)

@Marcin1960

This comment has been minimized.

Copy link

Marcin1960 commented Jun 18, 2018

Perhaps someone with a strong PC can test the idea on a small fast net, one with gating 55% one with 51% and one with no gating?

@herazul

This comment has been minimized.

Copy link

herazul commented Jun 19, 2018

@Friday9i Ok got it.
So yeah if you do account for false positive in a form of "sometimes lost ELO" because of variance that go into negative WR part of your distribution, we are back on the original bias that false positives impact training, but it's not accounted for, because we can't predict how that will impact training.
That was a useless questioning on my part, sorry !

What i was thinking also is that WITHOUT impacting the training, we can't increase number of test match to decrease the gating without increasing risks of false positive.
Taking into account that if we increase the number of test match, we also decrease the number of training games (because we have a limited pool of computing power), there must be an ideal proportion of %WR gating-number of test match / number of training games.
And i think it would be expected that if progress slow down for any network size, it would surely become more and more interesting to gradually increase number of test match and reduce the gating accordingly.
I'm not sure that proportion would be possible to simulate it mathematically because changing training game / test match proportion impact training.

@Friday9i

This comment has been minimized.

Copy link
Author

Friday9i commented Jun 19, 2018

@herazul No problem!
In line with your second point. It's not possible to make a model because it has an unknown impact on the distribution of new nets. But qualitatively, we see that when we come closer to the peak of a net (eg currently with the 15x192 nets), there is less an less nets above 50% winrate while on the opposite, many new nets are stronger than the current best one, with frequent promotions, when we just switched to a larger net. That would encourage us to take more risks when switching to larger nets (by decreasing the number of match games) while being more prudent when it stagnates from some time (by increasing the number of match games). But that is a quite "qualitative" appreciation of the situation, linked to the uncertain distribution of new nets strength, and no idea how many games should be played depending on this context...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.