Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.10 released - Next steps #591

Open
gcp opened this issue Jan 7, 2018 · 730 comments
Open

Version 0.10 released - Next steps #591

gcp opened this issue Jan 7, 2018 · 730 comments

Comments

@gcp
Copy link
Member

gcp commented Jan 7, 2018

Version 0.10 is released now. If no major bugs surface in the next few days the server will start enforcing this version.

There is this 1500+ post issue where most plans for the future were posted in the past. It's become rather problematic to read, especially on mobile, and mixed with a lot of theories (most not backed with any data or experiments 😉) so I'll post my plans and thoughts for the near future in this issue.

It looks like we're slowly reaching the maximum what 64x5 is capable of. I will let this run until about 2/3 of the training window is from the same network without improvement, and then drop the learning rate. I expect that's the last time we can do that and (maybe!) see some improvement.

I have been training a 128x6 network starting from the bug-fixed data (i.e. starting around eebb910d) and gradually moving it up to present day. Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it. If that works out, we can just continue from there and effectively skip the first 6500 Elo and see how much higher we can get (and perhaps do the same with even bigger networks) from continuing the current run.

If that kind of bootstrapping turns out not to work, I'd be interested in doing a new run. My ideas for that right now:

  • Somewhere between 128x6 and 128x10 sized network. 128x10 would be 8 times slower, but there is a ~2x speed improvement that we could expect to have merged in by then, and the total running time would be around half a year maybe? "Short" enough that people are probably mostly going to stick around. Hopefully also "big" enough that we can see pro level play.

  • Immediately use new networks for self play (i.e. according to the latest AZ paper). We see very strong strength see-sawing right now. It is possible that using the new network immediately lets the learning figure out why some of those are bad and thus produce faster improvement. It's also possible that this procedure produces no or very slow improvement for our unsynchronized distributed set up and this run ends up being a total failure. But I think we should try to find this out, in the interest of answering the question in case anyone ever tries a "full" 256x20 run on BOINC or an improved version of this project.

  • Small revision of weights format. There is a redundant bias layer (convolution output before BN) that needs to go, and I want to add a shift layer after the BatchNorm layers. The latter hasn't been generally shown to provide improvements (and I never found any in Go either), but it is computationally almost free and makes the design more generic, so we might as well include it. (Note that scale layers are completely redundant in the AGZ architecture so no point in adding those)

  • There's been a demonstration that instead of stopping at 1600 playouts per move, it may be more computationally efficient to stop at 2200 "visits" per move. So we should do that.

Thanks to all who have contributed computation power and code contributions to the project so far. We've validated that the AlphaGo approach is reproducible in a distributed setting - even if only on smaller scale - and made a dan player appear out of thin air.

Some personal words:

I have been very, very happy with the quality and extent of code contributions so far. It seems that many of you have found the codebase approachable enough to make major enhancements, or use it as a base for further learning or other experiments about Go or machine learning. I could not have hoped for a more positive outcome in that regard. My initial estimate was that 10-50 people would run the client, maybe one person would submit build fixes, and that would be it. Clearly, I was off by an order of magnitude, and I'm spending much more time than foreseen on doing things like reviewing pull requests etc. So please have some patience in that regard - I will keep trying to do those thoroughly.

For the people who have a lot of ideas and like to argue: convincing, actionable data (or even better, code that can be tested for effectiveness) will make my opinion flip-flop like the best/worst politician, whereas arguing with words only is likely to be as fun and effective as slamming your head against a wall repeatedly.

Miscellaneous:

I am very interested in any ideas or contributions that make me more redundant for this project. I have some ideas of my own that I want to test. My wife would also like to see me again!

The training and server portions run fully automatically now for the most part (cough @roy7), although some other things like uploading training data have been proven problematic to automate, so that won't be live for the foreseeable future either.

There's been a lot of concern about bad actors, vandalism, broken clients, etc, but so far the learning seems to be simply robust against this. There is now some ability to start filtering bad training data, but it remains tricky to make this solid and not give too much false positives. I'd advise only worrying when there are actual problems.

@RavnaBergsndot
Copy link

RavnaBergsndot commented Jan 7, 2018

How does the 128x6 network training work? If we keep the same training pipeline as 64x5, it needs testing self-plays. Would the self-plays be done by some contributing clients, or would you do all of them yourself?

@ssj-gz
Copy link

ssj-gz commented Jan 7, 2018

"Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it."

Does it beat it now? :)

@marcocalignano
Copy link
Member

@gcp Thanks!

@gcp
Copy link
Member Author

gcp commented Jan 7, 2018

Would the self-plays be done by some contributing clients, or would you do all of them yourself?

I can upload the networks and schedule tests for them, same as it happens for the regular networks. The clients won't really notice, they'll just run a bit slower :-)

@john45678
Copy link

@gcp thanks.

@jkiliani
Copy link

jkiliani commented Jan 7, 2018

Thank you for running this project, it's been a delight to follow and contribute wherever possible!

About the increase in network size: Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better? In other words, is there a significant possibility that something like 64x10 might reach similar strength as 128x6? Is there any way other than training supervised nets to find out?

@fishcu
Copy link

fishcu commented Jan 7, 2018

GCP and everyone involved, thank you very much for all your efforts! This project has been truly fascinating to follow it, both as a Go player as well as a developer. Looking forward to further experiments.

@Matuiss2
Copy link

Matuiss2 commented Jan 7, 2018

Thks for the computer only version im generating games in 17 minutes instead of 5 hours XD, what a massive improvement!

@grolich
Copy link

grolich commented Jan 7, 2018

Thank you for running and managing this wonderful project :)

@evanroberts85
Copy link

evanroberts85 commented Jan 7, 2018

Plans sound good. Just to be clear, with the 128x6. network are you moving up by the intevals of the 65x5 “best” networks using the 250k game window that those networs were trained upon? Also are you using a set number of steps? I guess this should work but other approaches are likely to work better.

Also, this will not really tell us if the difference in strength is due to the size of the networks or the differences in how they are trained.

@RavnaBergsndot
Copy link

Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better?

Isn't Deepmind's approach stacking both more filters and more layers? AGZ has 256 filters.

@gcp
Copy link
Member Author

gcp commented Jan 7, 2018

About the increase in network size: Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better? In other words, is there a significant possibility that something like 64x10 might reach similar strength as 128x6? Is there any way other than training supervised nets to find out?

I believe that in general stacking deeper is more attractive for the same (theoretical!) computational effort. You leave more opportunity to develop "higher level" features (or not, when not needed, especially in a resnet where inputs are forwarded!), or possibility for features to spread out their influence spatially. Deeper stacks are harder to train, but ResNets and BN appear to be pretty good at dealing with that.

But in terms of computational efficiency, a larger amount of filters tends to behave better, especially on big GPUs, because that part of the computation goes in parallel. The layers need to be processed serially.

"In theory" 128 filters are 4 times slower than 64 filters, but in practice, the difference is going to be much smaller.

@gcp
Copy link
Member Author

gcp commented Jan 7, 2018

Isn't Deepmind's approach stacking both more filters and more layers? AGZ has 256 filters.

They did 256x20 and 256x40. They did not do 384x20, for example.

@gcp
Copy link
Member Author

gcp commented Jan 7, 2018

Just to be clear, with the 128x6. network are you moving up by the intevals of the 65x5 “best” networks using the 250k game window that those networs were trained upon? I guess this should work but other approaches are likely to work better.

No, I started with a huge window and have been narrowing it to 250k.

@barrtgt
Copy link

barrtgt commented Jan 7, 2018

Thanks for the fantastic work! I'm interested in knowing how results for 2200 visits bit were obtained. Also, has anyone trained a supervised network with different depths and filters?

@gcp
Copy link
Member Author

gcp commented Jan 7, 2018

I'm interested in knowing how results for 2200 visits bit were obtained

See the discussion in #546. There's still some work ongoing in this area, and further testing, but it looks promising. The idea is not to spend too much effort in lines that are very forced anyway.

@isty2e
Copy link

isty2e commented Jan 7, 2018

First I would like to thank @gcp and all contributors for this awesome project and efforts. Now we can prepare for the next run, and here are things I would like to clarify or discuss:

  1. AGZ uses rectifier nonlinearity, while we are currently using ReLU if I am not terribly wrong. For the new network, it could be desirable to change the activation from ReLU to a nonlinear one, but unfortunately the AGZ paper lacks details about this. What would be our choice? There are many options like LeakyReLU, CReLU, or ELU.
  2. For the training window, I still do not see any advantage we gain from including data from too weak networks. In addition to the training window by number of games, how about filtering out games based on rating also (like 300 or any reasonable value)?
  3. I am still dubious if the AZ approach is reproducible for us, when the computational resource is fluctuating. A milder approach would be to accept-if-not-too-bad one, and to prioritize networks with more training steps. Is it reasonable enough?
  4. For networks with more filters, F(2x2, 3x3) Winograd convolution for CPU and GPU #523 must be merged somehow. What should be done to accomplish this as fast as possible?
  5. This question probably cannot be answered without any experiment, but I always have been thinking that 8-moves history for the ko detection is too much, though more feature planes will lead to a stronger AI somehow practically. Can we consider reducing the input dimension from current 8x2+2 to a smaller one, like 4x2+2?

@Dorus
Copy link

Dorus commented Jan 7, 2018

A milder approach would be to accept-if-not-too-bad one, and to prioritize networks with more training steps. Is it reasonable enough?

But the goal of AZ is to eliminate evaluation matches. If you need to know "not-too-bad", you need evaluation matches and you could as well go full AGZ. (This is pretty much the opposite argument of what we used to reject switching to the AZ method this run)

Also, shouldn't we try to reproduce AZ exactly because "we're not sure if it is reproducible"? If we change all kind of things and then fail (or succeed), we still do not know if it is because we changed a bunch of stuff or because it's an inherently bad method.

Anyway, before we start with a new larger network. How viable would it be to do one or a few runs with a smaller network, but with some variables adapted. For example we could use the current games and train a 3x32 network, and then run 500k games. After that run a 3x32 network from scratch and run 1m games to see the result. (And would these results carry over to larger networks?)

Another experiment i like to see would be to try different window sizes. We could use the current 5x64 network for that. Just go back 1m games, and train the then best network with a 100k or 500k window (or possibly 2 runs with each window size), and then run 500k games or so.

We're at 43k games/day now, so experiments like that would take ~2 weeks, but might give valuable data on our next run that might take several months. using a 3x32 network could probably 4 fold our game output and only take a couple days to get meaningful result.

@RavnaBergsndot
Copy link

RavnaBergsndot commented Jan 7, 2018

AGZ uses rectifier nonlinearity, while we are currently using ReLU if I am not terribly wrong. For the new network, it could be desirable to change the activation from ReLU to a nonlinear one, but unfortunately the AGZ paper lacks details about this. What would be our choice? There are many options like LeakyReLU, CReLU, or ELU.

BN dramatically reduces, if not totally eliminates the assumed advantages of other fancy activations over ReLU. This is why all those huge CNNs(Res-101, 1201, etc) prefer trying all kinds of different structures and filter/layer combinations rather than exploiting the seemingly low-hanging fruits of better activation functions. They are not low-hanging fruits because they only offer advantages in some non-general cases and controlled environments.

By the way ReLU is a nonlinear function, and two layers of ReLU could theoretically approximate all continuous functions, just like tanh and sigmoid.

This question probably cannot be answered without any experiment, but I always have been thinking that 8-moves history for the ko detection is too much, though more feature planes will lead to a stronger AI somehow practically. Can we consider reducing the input dimension from current 8x2+2 to a smaller one, like 4x2+2?

4x2+2 can't detect triple-ko.

@evanroberts85
Copy link

evanroberts85 commented Jan 7, 2018 via email

@isty2e
Copy link

isty2e commented Jan 7, 2018

@Dorus It is true that there is no evaluation in AZ, but I am not sure if it is the purpose. In fact, the motivation for changes from AGZ to AZ is unclear in the paper.

@RavnaBergsndot That is theoretically true to an extent, but in practice it affects the performance more or less, usually depending on the nature of the dataset. A simple example would be this. After all, this project aims to be a faithful replication of AGZ, so why not? Also AFAIK a triple ko consist of 6 moves so 3x2+2 will do, and we are not adopting superko rules, so is it any meaningful to detect a triple ko after all?

@barrtgt
Copy link

barrtgt commented Jan 7, 2018

A consistent training procedure with no added variables would be nice to compare different configurations. I think the 5x64 has quite a bit more potential, but was ham-stringed by a rough start. I like the idea of the AZ method of using the latest network. I vote to do a small scale AZ approach first.

@RavnaBergsndot
Copy link

RavnaBergsndot commented Jan 7, 2018

That is theoretically true to an extent, but in practice it affects the performance more or less, usually depending on the nature of the dataset. A simple example would be this. After all, this project aims to be a faithful replication of AGZ, so why not? Also AFAIK a triple ko consist of 6 moves so 3x2+2 will do, and we are not adopting superko rules, so is it any meaningful to detect a triple ko after all?

Most of the experiments in that paper were done without BN. BN enforces most input points falling into the most interesting part of the ReLU domain, therefore reduces the need of non-zero value outputs when the input is negative. We need more recent experiments.

I'm also not convinced that AGZ's "rectifier nonlinearity" means "rectifier plus nonlinearity on its negative domain" instead of just ReLU itself.

3x2+2 won't do, because that "x2" part is for the same turn. "These planes are concatenated together to give input features st = [Xt, Yt, Xt−1, Yt−1, ..., Xt−7, Yt−7, C]." Therefore for 6 moves, we need at least 6x2+2.

@ddyer0
Copy link

ddyer0 commented Jan 7, 2018

I feel the need for a more general discussion, more beginner friendly and less specifically about
leela. Please have a look at https://www.game-ai-forum.org/viewforum.php?f=21

@isty2e
Copy link

isty2e commented Jan 7, 2018

@RavnaBergsndot Well but the batch normalization was there in CIFAR-100 benchmark. And if the shift variable is somehow set inappropriately during training, the batchnorm layer can shift the input to the negative region of ReLU ("dying ReLU"), so that is the idea behind all the modified rectifier units. I can hardly imagine any case referring to ReLU by the "rectifier nonlinearity", because you know, ReLU is a rectifier linear unit.

And you are right about the input features, though I still do not see why we need to detect triple ko at the first place.

@grolich
Copy link

grolich commented Jan 8, 2018

though I still do not see why we need to detect triple ko at the first place.

Triple kos matter in rule systems without superkos.

Actually, they change the result of the game.
With superko, they would just make a move illegal.

Without it, they form a (really interesting in a game, I might add) situation where, if neither player is willing to give way, the game cannot end and is declared a draw (actually a "no result", but in situations where no return-matches are played, it's effectively the same).

Not including enough information for triple-ko detection in the NN would make the network unable to tell the difference between a situation where a move would end the game without a win or a loss, and one that will.

So even if we aren't interested in superko, it's still a bare minimum to be able to detect triple ko.

That being said, It might help a lot in superko detection as well, since gapped repetitions are exceedingly rare in actual play, perhaps sufficiently so that the "damage" of not recognizing these cases without search might not be felt.

However, the important thing was to demonstrate why triple ko detection is needed even if we do not use superko.

@ayssia
Copy link

ayssia commented Jan 8, 2018

Why 128 filters? 24 blocks * 64 filters should be consuming same time as 6 * 128, I wonder how blocks/filters affect strength...
Maybe we can train a 24 * 64 network and a 6 * 128 network, to compare with them?

@jkiliani
Copy link

jkiliani commented Jan 8, 2018

64 filters, 24 blocks will almost certainly use more time than 128 filters, 6 blocks. @gcp explained earlier that increasing the number of filters allows more parallelization and is thus usually much less than quadratic in computation time on a GPU. Layers have to be evaluated serially on the other hand.

@gcp
Copy link
Member Author

gcp commented Jan 8, 2018

I'm also not convinced that AGZ's "rectifier nonlinearity" means "rectifier plus nonlinearity on its negative domain" instead of just ReLU itself...I can hardly imagine any case referring to ReLU by the "rectifier nonlinearity", because you know, ReLU is a rectifier linear unit.

I'm 99.9% sure that "rectifier nonlinearity" exactly means ReLU. ReLU is a non-linear unit constructed from a rectifier and a linear unit. A rectified linear unit is a rectifier non-linearity.

As was already pointed out, the advantages of "more advanced" activation units disappear when there are BN layers involved, which is why everyone including DeepMind just uses BN+ReLU.

@gcp
Copy link
Member Author

gcp commented Jan 8, 2018

  1. In addition to the training window by number of games, how about filtering out games based on rating also (like 300 or any reasonable value)?

It's important to make sure the window has enough data or you will get catastrophic over-fitting, especially for the value heads. You can test this yourself. This can't be guaranteed if you introduce a rating cutoff so it's a bad idea.

@Dorus
Copy link

Dorus commented Feb 26, 2018

I think something happened just now when a91721af got promoted, as it's at 48% now. Complete blow out early on?

Yes, see the match history.
11-1 win (sprt pass) before going on a lose streak to end at 44.90% :(

We've discussed this before, the scenario just never happened till now. a9 has a better winrate the first 200 moves and thus has a bias on short matches. We start 40-50 matches right away, so the half of them that is relatively short arrives early and gives a biassed win %. So far we've had a bunch of 10-1 nets that eventualy failed, this is the first 11-1 net that did so. I guess it just got lucky.

The only way to prevent situations like this (and also 92dd0397 that promoted at 220-180 and then ended at 230 : 191 (54.63%)) would be to only accept a pass when all 400 games are send out and all ongoing games have returned. This would slow down promotions by 30 minutes to 2 hours, but not too hard to program on the server.

Another workaround would be a minimal number of like 40 games before we accept a SPRT pass.

@jkiliani
Copy link

@Dorus This did happen before, with af4f49f1. A minimum number of games in addition to SPRT pass would be a good fix in my opinion.

@LRGB
Copy link

LRGB commented Feb 26, 2018

Yes, this is the second time we've had an under 50% promotion, both times causing minor comotions :)
I like the minimum number of games ideia.

By the way, has any thought be given to adopting an intermediary step between the promotion criteria we are using now and always promote? Say a 50% treshhold (or lower)? While the always promote method clearly works (as does the one we are using now), I wonder if there isn't a sweet spot that balances network diversity vs quality and gives optimal results? After doing this for a while we could move to alway promote and compare?

@Marcin1960
Copy link

Marcin1960 commented Feb 26, 2018

I suspect that always promote helps new, temporarily crippling discoveries to be retained by the next stronger networks. So open doors to creativity. Of course on the condition that windows is large enough for the older stronger nets to be present.

@dzhurak
Copy link

dzhurak commented Feb 26, 2018

Do not forget that we use random noise in training games. It gives enough of creativity. I don't expect always promote approach will be any faster than current one. Also we have an example of MiniGo guys who used always promote technic and their results were not any better than ours.

P.S. Current methodology works great so as for me there is no need to change it. But I am biased to get the strongest bot not to discover the best algorithm.

@Marcin1960
Copy link

There is another logical option - if promoted net fails, promote previous best net back. (unless another 3rd net got promoted meanwhile, then I would pass)

@jjoshua2
Copy link

jjoshua2 commented Feb 26, 2018 via email

@2ji3150
Copy link

2ji3150 commented Feb 26, 2018

As you correctly anticipated and inferred, I cut learning rate to 0.00015 (which is "halfway" a 10x reduction) after we had 170k games without promotion. I also increased max steps to 256k and rejiggled the steps that will get matched a bit (but I made a mistake so the uploads this weekend stopped at 128k, now fixed).

Still don't see the 256k steps network.

@roy7
Copy link
Collaborator

roy7 commented Feb 26, 2018

@gcp So I clearly understand, you want a toggle on the submit-network endpoint you can flag that says "promote this as the new best network immediately"? I'm not sure how we'd change the graphing setup to deal with a best network that has no match history. Maybe another parameter to manually set an ELO score that would override the current way we calculate them all?

Is this towards an eventual new "always promote" system or something we'll use for 10x128? If it's for always promote, what's our thoughts on how we'll generate ELO figures in that case? Still schedule matches like usual?

In an always promote world, if we still want to run match games, we could set the endpoint to auto promote after upload and auto add a match between the new upload and the prior best network at the same time. (Edit: In this case, there would be no more promotion as the result of matches, simply results for ELO graphing.)

@Dorus
Copy link

Dorus commented Feb 26, 2018

I would propose you just replace best-net every x hours, but for the elo graph you schedule a match every y hours, where y is a bit larger than x. Like you get a new net every 4 hours, but you schedule a match every 24 hours.

A future bonus would be to run a more tournament like setting where you play match games vs a bunch of previous nets. Instead of 400 games every 24 hours, you could play 10 games against each of the last 6 nets, resulting in 60 match games per new network, and also around 6*60=360 per day. Each net would then play 120 match games, 60 against the nets before and 60 against the net after. However we would need to borrow some code somewhere to calculate elo from that data :)

Idk how hard BayesElo is to calculate, but i believe the cgos code is available.

wctgit referenced this issue in leela-zero/leela-zero-server Feb 27, 2018
@jjoshua2
Copy link

@pangafu what weights are w24, and what GPU are you using? Wondering about how many playouts your unlimited bot on CGOS gets. It's doing remarkably well, almost as good as the 20 block 1600 playouts.

@ghost
Copy link

ghost commented Feb 28, 2018

@jjoshua2 Might be better to compare the 1600 playout versions since more playouts make it a lot stronger.

@jjoshua2
Copy link

They are interesting too. The HW24 1600 is just barely ahead of HW23 1600, and they are both ahead of all the other lz bots, besides the 20 blocks. Even ahead of the lzladder-666 which had 3200 playouts and ladder knowledge.

@alreadydone
Copy link
Contributor

alreadydone commented Feb 28, 2018

478fdf6ffc4af5de
I think in the first line, the first txt is W24 and the second is W23.
(If you don't know what these mean, see #814 and pangafu/Hybrid_LeelaZero.)

@jjoshua2
Copy link

jjoshua2 commented Feb 28, 2018 via email

@pangafu
Copy link

pangafu commented Mar 1, 2018

@jjoshua2 Please see #954 , My test machine is 1070 8G

@pangafu
Copy link

pangafu commented Mar 1, 2018

@alreadydone yes, the first is W24, and the second is W23 in first line.

@pangafu
Copy link

pangafu commented Mar 1, 2018

@jjoshua2 in my test, the strongest hybrid weight is often not made by the strongest parent, the mix of 40%+ win rate weights often get the strongest one.

@alreadydone
Copy link
Contributor

alreadydone commented Mar 1, 2018

@gcp If hybrid weights are stronger due to noises being averaged out, it's probably a good idea to increase the batch size (we are using 512 due to GPU memory limit while AGZ and AZ are using 2048 and 4096 respectively). I suggested in another thread that we could use openai/gradient-checkpointing (see also Medium post) to reduce memory usage with slight increase in training time. A graph shows that peak memory usage is reduced by a half for a 6-block ResNet with batch size 1280, so since we can fit batch size 512 into GPU memory before we should now be able to fit batch size 1024 now.

@jkiliani
Copy link

jkiliani commented Mar 2, 2018

@gcp I think we're ready for the final learning rate reduction of 6x128: Considering the regression with a91721af, there has been no significant progress since 92dd0397, that's already more than 2/3 of a window. Are there any 10 block nets to start testing (probably against other than best net) yet?

@jkiliani
Copy link

jkiliani commented Mar 2, 2018

Since the bootstrap is a full blown success on the first try, can we simply do the switch to 10 blocks now? The new net looks great!

@dzhurak
Copy link

dzhurak commented Mar 2, 2018

Very impressive!

@bochen2027
Copy link

have we switched to the new net size yet

@qrss
Copy link

qrss commented Apr 3, 2018

@Hydrogenpi For information on training progress and the current network size, have a look at this page:
http://zero.sjeng.org/

You can hover over the network names in the Test Matches table and it will show you the network size.
The recent matches are with networks of size 128x10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests