-
Notifications
You must be signed in to change notification settings - Fork 1k
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 0.10 released - Next steps #591
Comments
How does the 128x6 network training work? If we keep the same training pipeline as 64x5, it needs testing self-plays. Would the self-plays be done by some contributing clients, or would you do all of them yourself? |
"Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it." Does it beat it now? :) |
@gcp Thanks! |
I can upload the networks and schedule tests for them, same as it happens for the regular networks. The clients won't really notice, they'll just run a bit slower :-) |
@gcp thanks. |
Thank you for running this project, it's been a delight to follow and contribute wherever possible! About the increase in network size: Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better? In other words, is there a significant possibility that something like 64x10 might reach similar strength as 128x6? Is there any way other than training supervised nets to find out? |
GCP and everyone involved, thank you very much for all your efforts! This project has been truly fascinating to follow it, both as a Go player as well as a developer. Looking forward to further experiments. |
Thks for the computer only version im generating games in 17 minutes instead of 5 hours XD, what a massive improvement! |
Thank you for running and managing this wonderful project :) |
Plans sound good. Just to be clear, with the 128x6. network are you moving up by the intevals of the 65x5 “best” networks using the 250k game window that those networs were trained upon? Also are you using a set number of steps? I guess this should work but other approaches are likely to work better. Also, this will not really tell us if the difference in strength is due to the size of the networks or the differences in how they are trained. |
Isn't Deepmind's approach stacking both more filters and more layers? AGZ has 256 filters. |
I believe that in general stacking deeper is more attractive for the same (theoretical!) computational effort. You leave more opportunity to develop "higher level" features (or not, when not needed, especially in a resnet where inputs are forwarded!), or possibility for features to spread out their influence spatially. Deeper stacks are harder to train, but ResNets and BN appear to be pretty good at dealing with that. But in terms of computational efficiency, a larger amount of filters tends to behave better, especially on big GPUs, because that part of the computation goes in parallel. The layers need to be processed serially. "In theory" 128 filters are 4 times slower than 64 filters, but in practice, the difference is going to be much smaller. |
They did 256x20 and 256x40. They did not do 384x20, for example. |
No, I started with a huge window and have been narrowing it to 250k. |
Thanks for the fantastic work! I'm interested in knowing how results for 2200 visits bit were obtained. Also, has anyone trained a supervised network with different depths and filters? |
See the discussion in #546. There's still some work ongoing in this area, and further testing, but it looks promising. The idea is not to spend too much effort in lines that are very forced anyway. |
First I would like to thank @gcp and all contributors for this awesome project and efforts. Now we can prepare for the next run, and here are things I would like to clarify or discuss:
|
But the goal of AZ is to eliminate evaluation matches. If you need to know "not-too-bad", you need evaluation matches and you could as well go full AGZ. (This is pretty much the opposite argument of what we used to reject switching to the AZ method this run) Also, shouldn't we try to reproduce AZ exactly because "we're not sure if it is reproducible"? If we change all kind of things and then fail (or succeed), we still do not know if it is because we changed a bunch of stuff or because it's an inherently bad method. Anyway, before we start with a new larger network. How viable would it be to do one or a few runs with a smaller network, but with some variables adapted. For example we could use the current games and train a 3x32 network, and then run 500k games. After that run a 3x32 network from scratch and run 1m games to see the result. (And would these results carry over to larger networks?) Another experiment i like to see would be to try different window sizes. We could use the current 5x64 network for that. Just go back 1m games, and train the then best network with a 100k or 500k window (or possibly 2 runs with each window size), and then run 500k games or so. We're at 43k games/day now, so experiments like that would take ~2 weeks, but might give valuable data on our next run that might take several months. using a 3x32 network could probably 4 fold our game output and only take a couple days to get meaningful result. |
BN dramatically reduces, if not totally eliminates the assumed advantages of other fancy activations over ReLU. This is why all those huge CNNs(Res-101, 1201, etc) prefer trying all kinds of different structures and filter/layer combinations rather than exploiting the seemingly low-hanging fruits of better activation functions. They are not low-hanging fruits because they only offer advantages in some non-general cases and controlled environments. By the way ReLU is a nonlinear function, and two layers of ReLU could theoretically approximate all continuous functions, just like tanh and sigmoid.
4x2+2 can't detect triple-ko. |
I started [the 128x6 network] with a huge window and have been narrowing it.
You could repeat the same process but with a 64x5 network like the current
one, to see how much of the gain (if there is a gain) comes from the
increase in network size and what effect just changing the training had.
|
@Dorus It is true that there is no evaluation in AZ, but I am not sure if it is the purpose. In fact, the motivation for changes from AGZ to AZ is unclear in the paper. @RavnaBergsndot That is theoretically true to an extent, but in practice it affects the performance more or less, usually depending on the nature of the dataset. A simple example would be this. After all, this project aims to be a faithful replication of AGZ, so why not? Also AFAIK a triple ko consist of 6 moves so 3x2+2 will do, and we are not adopting superko rules, so is it any meaningful to detect a triple ko after all? |
A consistent training procedure with no added variables would be nice to compare different configurations. I think the 5x64 has quite a bit more potential, but was ham-stringed by a rough start. I like the idea of the AZ method of using the latest network. I vote to do a small scale AZ approach first. |
Most of the experiments in that paper were done without BN. BN enforces most input points falling into the most interesting part of the ReLU domain, therefore reduces the need of non-zero value outputs when the input is negative. We need more recent experiments. I'm also not convinced that AGZ's "rectifier nonlinearity" means "rectifier plus nonlinearity on its negative domain" instead of just ReLU itself. 3x2+2 won't do, because that "x2" part is for the same turn. "These planes are concatenated together to give input features st = [Xt, Yt, Xt−1, Yt−1, ..., Xt−7, Yt−7, C]." Therefore for 6 moves, we need at least 6x2+2. |
I feel the need for a more general discussion, more beginner friendly and less specifically about |
@RavnaBergsndot Well but the batch normalization was there in CIFAR-100 benchmark. And if the shift variable is somehow set inappropriately during training, the batchnorm layer can shift the input to the negative region of ReLU ("dying ReLU"), so that is the idea behind all the modified rectifier units. I can hardly imagine any case referring to ReLU by the "rectifier nonlinearity", because you know, ReLU is a rectifier linear unit. And you are right about the input features, though I still do not see why we need to detect triple ko at the first place. |
Triple kos matter in rule systems without superkos. Actually, they change the result of the game. Without it, they form a (really interesting in a game, I might add) situation where, if neither player is willing to give way, the game cannot end and is declared a draw (actually a "no result", but in situations where no return-matches are played, it's effectively the same). Not including enough information for triple-ko detection in the NN would make the network unable to tell the difference between a situation where a move would end the game without a win or a loss, and one that will. So even if we aren't interested in superko, it's still a bare minimum to be able to detect triple ko. That being said, It might help a lot in superko detection as well, since gapped repetitions are exceedingly rare in actual play, perhaps sufficiently so that the "damage" of not recognizing these cases without search might not be felt. However, the important thing was to demonstrate why triple ko detection is needed even if we do not use superko. |
Why 128 filters? 24 blocks * 64 filters should be consuming same time as 6 * 128, I wonder how blocks/filters affect strength... |
64 filters, 24 blocks will almost certainly use more time than 128 filters, 6 blocks. @gcp explained earlier that increasing the number of filters allows more parallelization and is thus usually much less than quadratic in computation time on a GPU. Layers have to be evaluated serially on the other hand. |
I'm 99.9% sure that "rectifier nonlinearity" exactly means ReLU. ReLU is a non-linear unit constructed from a rectifier and a linear unit. A rectified linear unit is a rectifier non-linearity. As was already pointed out, the advantages of "more advanced" activation units disappear when there are BN layers involved, which is why everyone including DeepMind just uses BN+ReLU. |
It's important to make sure the window has enough data or you will get catastrophic over-fitting, especially for the value heads. You can test this yourself. This can't be guaranteed if you introduce a rating cutoff so it's a bad idea. |
Yes, see the match history. We've discussed this before, the scenario just never happened till now. a9 has a better winrate the first 200 moves and thus has a bias on short matches. We start 40-50 matches right away, so the half of them that is relatively short arrives early and gives a biassed win %. So far we've had a bunch of 10-1 nets that eventualy failed, this is the first 11-1 net that did so. I guess it just got lucky. The only way to prevent situations like this (and also 92dd0397 that promoted at 220-180 and then ended at 230 : 191 (54.63%)) would be to only accept a pass when all 400 games are send out and all ongoing games have returned. This would slow down promotions by 30 minutes to 2 hours, but not too hard to program on the server. Another workaround would be a minimal number of like 40 games before we accept a SPRT pass. |
@Dorus This did happen before, with af4f49f1. A minimum number of games in addition to SPRT pass would be a good fix in my opinion. |
Yes, this is the second time we've had an under 50% promotion, both times causing minor comotions :) By the way, has any thought be given to adopting an intermediary step between the promotion criteria we are using now and always promote? Say a 50% treshhold (or lower)? While the always promote method clearly works (as does the one we are using now), I wonder if there isn't a sweet spot that balances network diversity vs quality and gives optimal results? After doing this for a while we could move to alway promote and compare? |
I suspect that always promote helps new, temporarily crippling discoveries to be retained by the next stronger networks. So open doors to creativity. Of course on the condition that windows is large enough for the older stronger nets to be present. |
Do not forget that we use random noise in training games. It gives enough of creativity. I don't expect always promote approach will be any faster than current one. Also we have an example of MiniGo guys who used always promote technic and their results were not any better than ours. P.S. Current methodology works great so as for me there is no need to change it. But I am biased to get the strongest bot not to discover the best algorithm. |
There is another logical option - if promoted net fails, promote previous best net back. (unless another 3rd net got promoted meanwhile, then I would pass) |
This is my favorite option. It gets new nets running soonest and is closer
to the alpha zero model. Some variety of games with strong early moves
can't hurt that much if it's only a few thousand games.
I was wondering about a possible but for people that don't have internet on
weekends it keeps on playing with the last net. Do those get added to the
front of the training window or with the net they came from once they are
finally submitted? I'm wondering if some games in the 250k window can be
from older nets from odd clients.
…On Feb 26, 2018 10:59 AM, "Marcin1960" ***@***.***> wrote:
There is another logical option - if promoted net fails, promote previous
best net back. (unless another 3rd net got promoted meanwhile, then I would
pass)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#591 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AO6INBOaFX_oL2ciRv9Lbnf4Qt9gHxc-ks5tYtTngaJpZM4RVxWY>
.
|
Still don't see the 256k steps network. |
@gcp So I clearly understand, you want a toggle on the submit-network endpoint you can flag that says "promote this as the new best network immediately"? I'm not sure how we'd change the graphing setup to deal with a best network that has no match history. Maybe another parameter to manually set an ELO score that would override the current way we calculate them all? Is this towards an eventual new "always promote" system or something we'll use for 10x128? If it's for always promote, what's our thoughts on how we'll generate ELO figures in that case? Still schedule matches like usual? In an always promote world, if we still want to run match games, we could set the endpoint to auto promote after upload and auto add a match between the new upload and the prior best network at the same time. (Edit: In this case, there would be no more promotion as the result of matches, simply results for ELO graphing.) |
I would propose you just replace best-net every x hours, but for the elo graph you schedule a match every y hours, where y is a bit larger than x. Like you get a new net every 4 hours, but you schedule a match every 24 hours. A future bonus would be to run a more tournament like setting where you play match games vs a bunch of previous nets. Instead of 400 games every 24 hours, you could play 10 games against each of the last 6 nets, resulting in 60 match games per new network, and also around 6*60=360 per day. Each net would then play 120 match games, 60 against the nets before and 60 against the net after. However we would need to borrow some code somewhere to calculate elo from that data :) Idk how hard BayesElo is to calculate, but i believe the cgos code is available. |
@pangafu what weights are w24, and what GPU are you using? Wondering about how many playouts your unlimited bot on CGOS gets. It's doing remarkably well, almost as good as the 20 block 1600 playouts. |
@jjoshua2 Might be better to compare the 1600 playout versions since more playouts make it a lot stronger. |
They are interesting too. The HW24 1600 is just barely ahead of HW23 1600, and they are both ahead of all the other lz bots, besides the 20 blocks. Even ahead of the lzladder-666 which had 3200 playouts and ladder knowledge. |
|
Wowers a hybrid of 9 nets? I thought the 6 of hybrid 23 was a lot. Once you
get that high there's lots of combinations to test. I'm surprised limiting
it to the 3 strongest isn't better since nets are still progressing.
…On Feb 28, 2018 2:03 PM, "Junyan Xu" ***@***.***> wrote:
[image: 478fdf6ffc4af5de]
<https://user-images.githubusercontent.com/3064145/36807091-e3e82454-1cb9-11e8-96b7-842609dbfbb0.png>
I think in the first line, the first txt is W24 and the second is W23.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#591 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AO6INBUnGx7XXHait5B5ttRLSq8DsRsJks5tZaMdgaJpZM4RVxWY>
.
|
@alreadydone yes, the first is W24, and the second is W23 in first line. |
@jjoshua2 in my test, the strongest hybrid weight is often not made by the strongest parent, the mix of 40%+ win rate weights often get the strongest one. |
@gcp If hybrid weights are stronger due to noises being averaged out, it's probably a good idea to increase the batch size (we are using 512 due to GPU memory limit while AGZ and AZ are using 2048 and 4096 respectively). I suggested in another thread that we could use openai/gradient-checkpointing (see also Medium post) to reduce memory usage with slight increase in training time. A graph shows that peak memory usage is reduced by a half for a 6-block ResNet with batch size 1280, so since we can fit batch size 512 into GPU memory before we should now be able to fit batch size 1024 now. |
@gcp I think we're ready for the final learning rate reduction of 6x128: Considering the regression with a91721af, there has been no significant progress since 92dd0397, that's already more than 2/3 of a window. Are there any 10 block nets to start testing (probably against other than best net) yet? |
Since the bootstrap is a full blown success on the first try, can we simply do the switch to 10 blocks now? The new net looks great! |
Very impressive! |
have we switched to the new net size yet |
@Hydrogenpi For information on training progress and the current network size, have a look at this page: You can hover over the network names in the Test Matches table and it will show you the network size. |
Version 0.10 is released now. If no major bugs surface in the next few days the server will start enforcing this version.
There is this 1500+ post issue where most plans for the future were posted in the past. It's become rather problematic to read, especially on mobile, and mixed with a lot of theories (most not backed with any data or experiments 😉) so I'll post my plans and thoughts for the near future in this issue.
It looks like we're slowly reaching the maximum what 64x5 is capable of. I will let this run until about 2/3 of the training window is from the same network without improvement, and then drop the learning rate. I expect that's the last time we can do that and (maybe!) see some improvement.
I have been training a 128x6 network starting from the bug-fixed data (i.e. starting around eebb910d) and gradually moving it up to present day. Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it. If that works out, we can just continue from there and effectively skip the first 6500 Elo and see how much higher we can get (and perhaps do the same with even bigger networks) from continuing the current run.
If that kind of bootstrapping turns out not to work, I'd be interested in doing a new run. My ideas for that right now:
Somewhere between 128x6 and 128x10 sized network. 128x10 would be 8 times slower, but there is a ~2x speed improvement that we could expect to have merged in by then, and the total running time would be around half a year maybe? "Short" enough that people are probably mostly going to stick around. Hopefully also "big" enough that we can see pro level play.
Immediately use new networks for self play (i.e. according to the latest AZ paper). We see very strong strength see-sawing right now. It is possible that using the new network immediately lets the learning figure out why some of those are bad and thus produce faster improvement. It's also possible that this procedure produces no or very slow improvement for our unsynchronized distributed set up and this run ends up being a total failure. But I think we should try to find this out, in the interest of answering the question in case anyone ever tries a "full" 256x20 run on BOINC or an improved version of this project.
Small revision of weights format. There is a redundant bias layer (convolution output before BN) that needs to go, and I want to add a shift layer after the BatchNorm layers. The latter hasn't been generally shown to provide improvements (and I never found any in Go either), but it is computationally almost free and makes the design more generic, so we might as well include it. (Note that scale layers are completely redundant in the AGZ architecture so no point in adding those)
There's been a demonstration that instead of stopping at 1600 playouts per move, it may be more computationally efficient to stop at 2200 "visits" per move. So we should do that.
Thanks to all who have contributed computation power and code contributions to the project so far. We've validated that the AlphaGo approach is reproducible in a distributed setting - even if only on smaller scale - and made a dan player appear out of thin air.
Some personal words:
I have been very, very happy with the quality and extent of code contributions so far. It seems that many of you have found the codebase approachable enough to make major enhancements, or use it as a base for further learning or other experiments about Go or machine learning. I could not have hoped for a more positive outcome in that regard. My initial estimate was that 10-50 people would run the client, maybe one person would submit build fixes, and that would be it. Clearly, I was off by an order of magnitude, and I'm spending much more time than foreseen on doing things like reviewing pull requests etc. So please have some patience in that regard - I will keep trying to do those thoroughly.
For the people who have a lot of ideas and like to argue: convincing, actionable data (or even better, code that can be tested for effectiveness) will make my opinion flip-flop like the best/worst politician, whereas arguing with words only is likely to be as fun and effective as slamming your head against a wall repeatedly.
Miscellaneous:
I am very interested in any ideas or contributions that make me more redundant for this project. I have some ideas of my own that I want to test. My wife would also like to see me again!
The training and server portions run fully automatically now for the most part (cough @roy7), although some other things like uploading training data have been proven problematic to automate, so that won't be live for the foreseeable future either.
There's been a lot of concern about bad actors, vandalism, broken clients, etc, but so far the learning seems to be simply robust against this. There is now some ability to start filtering bad training data, but it remains tricky to make this solid and not give too much false positives. I'd advise only worrying when there are actual problems.
The text was updated successfully, but these errors were encountered: