New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is Leelaz trained off all the the games, or just the more recent ones? #484
Comments
250k games are used in training typically. |
I'm assuming that the script used for training isn't training/tf/parse.py, but something else? |
It is parse.py. The network is fairly small and TensorFlow/python is rather slow so it ends up largely CPU limited indeed. |
FWIW the server is @roy7's work and he was interested in open sourcing it, but I think he hasn't gotten around to cleaning up the code yet. It was originally just a few lines of node.js that shoved the submitted data into mongodb, but due to the distributed testing it has gotten more complicated. The code that dumps the training data is here: https://pastebin.mozilla.org/9075155 Obviously it's useless without the full DB which I won't publish right now because it has IP addresses etc. I might anonymize it after the project. |
Few more questions...
Or alternatively: training is stopped, all chunks are removed, the most recent 250k games are dumped, and training resumes?
|
You're correct on #1, if I understand what you are asking. Training doesn't go on forever though. Starting with current best-network, you'll download latest games, set up 250K window, then train. Every so many steps a network is made for match testing. Once the max # of steps is hit (128K now I believe), download all new games, set up to 250K window chunks, and start over using best-network's weight file. So training is always done with best-network as starting point, and new data. Multiple steps files are tested. Repeat. |
I am pretty sure @gcp said he does not use the learning rates exactly as stated by AlphaGo, but adjusts them for our lower batch size. |
^^ This. Mostly because it was simpler.
step 13900, policy=3.09926 mse=0.164995 reg=0.0880906 total=3.35235 (4411.87 pos/s) policy loss is very dependent on the training data. With supervised data you should be able to get around 1.5-1.8. For self-play it is gradually dropping as the program gets stronger.
It's a default that fits on the RAM of most people's cards. I use 512 and cut the learning rate. Batch size 4096 won't fit on any card for the huge original AlphaGo resnet, they used 64 GPU x 64 batches or something similar. Note that the default learning rate in the code is for supervised learning. (I forgot to change this on the training machine at some point...) |
Note that the network in use is 5x64, not 6x128 as the code defaults. This obviously also affects the expected loss. The defaults in the code are what was used for the supervised network in the README. |
I guess if we are running into memory limits we can't make use of the recent paper from Google "Don't Decay the Learning Rate, Increase the Batch Size". It points out speed increases and reduced need for hyperparamter tuning. https://arxiv.org/pdf/1711.00489.pdf |
I just noticed that the summaries have been uploaded to https://sjeng.org/zero/ which I very much appreciate! Thank you :)
That's a decent pos/s! May I ask what hardware are you using for that? Or is it based on 16-bit floats?
ack.
That makes sense, thanks.
Ahh! Everything makes much more sense now! :) Hmmm. When running with a 5x64 network my GPU isn't saturated any more when training. bummer. Better go make it faster! |
Funny story time. I started investigating how I could further improve the speed. Whilst heading down that path, I noticed that the benchmark didn't give the same numbers it did yesterday. Or rather, it did sometimes but was very erratic. Much confusion and head-scratching. clean, rebuild. pin threads to cores. make random code changes. construct voodoo doll. etc etc Eventually I notice that it's faster after I've been writing code for a while!?! I check cpu frequency while benchmark is running and see that it's running less than half-speed. Deep into thermal throttling! Open machine, look inside, notice that a cable has moved and is now interfering with the CPU fan, stopping it turning. Move cable. Add cable tie. Benchmarks run twice as fast and no longer erratic. Run full training. It's GPU-bound, not CPU bound. all done :) |
The training machine is a dedicated Ryzen 1700 with a GTX 1080 Ti on Ubuntu 16.04, with a TF 1.4 compile linked to cuDNN 7. |
There's very little in this paper that isn't already well known or has even been discussed at length here. Larger batch sizes are faster. Batch size and learning rate are directly related. PSA: If you read a machine learning paper (there are a shit ton coming out daily) and it claims some improvement (hint: they all do, it's hard to publish a paper if you don't), the odds that it would help this project are about 0 unless the paper is now being quoted by every new paper coming out, and has been added as a standard feature in every DL framework. |
@gcp Do you have other optimizations pulled in other than just using /next? Since your pos/sec is twice mine. |
@roy7 I think this is due to use a 5x64 network? e.g. in training/tf/tfprocess.py set
Changing that, I go from ~1300 pos/s to ~3300 pos/s on my hardware. |
Oh maybe I've been training wrong sized networks all along, lol. I just run the scripts with the defaults as they come from the /next branch. Looks like 6x128 is the default. Ouch! |
Does that mean that some of the accepted networks were trained with 6x128? Even though all the self play is still 5x64 |
@gcp comments above:
I assume that means that this is the shape of the network being distributed to self-play clients. |
I'm assuming it's trained off the more recent ones? If so, how many are used?
As an aside: are the scripts used to drive http://zero.sjeng.org/ checked in anywhere? I suspect that would answer all of my questions :)
The text was updated successfully, but these errors were encountered: