Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Leelaz trained off all the the games, or just the more recent ones? #484

Closed
ywrt opened this issue Dec 24, 2017 · 19 comments
Closed

Is Leelaz trained off all the the games, or just the more recent ones? #484

ywrt opened this issue Dec 24, 2017 · 19 comments

Comments

@ywrt
Copy link
Contributor

ywrt commented Dec 24, 2017

I'm assuming it's trained off the more recent ones? If so, how many are used?

As an aside: are the scripts used to drive http://zero.sjeng.org/ checked in anywhere? I suspect that would answer all of my questions :)

@OmnipotentEntity
Copy link
Contributor

250k games are used in training typically.

@ywrt
Copy link
Contributor Author

ywrt commented Dec 24, 2017

I'm assuming that the script used for training isn't training/tf/parse.py, but something else?
(Because parse.py is CPU bound parsing the training files, and not GPU bound actually training the network)

@gcp
Copy link
Member

gcp commented Dec 24, 2017

It is parse.py.

The network is fairly small and TensorFlow/python is rather slow so it ends up largely CPU limited indeed.

@gcp
Copy link
Member

gcp commented Dec 24, 2017

As an aside: are the scripts used to drive http://zero.sjeng.org/ checked in anywhere? I suspect that would answer all of my questions :)

FWIW the server is @roy7's work and he was interested in open sourcing it, but I think he hasn't gotten around to cleaning up the code yet. It was originally just a few lines of node.js that shoved the submitted data into mongodb, but due to the distributed testing it has gotten more complicated.

The code that dumps the training data is here: https://pastebin.mozilla.org/9075155

Obviously it's useless without the full DB which I won't publish right now because it has IP addresses etc. I might anonymize it after the project.

@ywrt
Copy link
Contributor Author

ywrt commented Dec 26, 2017

Few more questions...

  1. I'm assuming that training continues from the previous network? That is: At some interval: New chunks are created from games that have arrived, some old chunks are removed and then training resumes?

Or alternatively: training is stopped, all chunks are removed, the most recent 250k games are dumped, and training resumes?

  1. What policy loss is currently seen in training?
    (My from-scratch training is currently showing step 157600, policy=2.90912 mse=0.155379 reg=0.181329 total=3.24583 (1320.76 pos/s)

  2. Why is the batch size set to 256? As far as I can tell, this is too small for the learning rate (as indicated by saturation in the policy loss)?
    I suspect the batch size should be increased to at least 1024 (or alternatively: reduce the learning rate, but larger batch is preferable). FWIW: The AlphaZero paper mentions a batch size of 4096. Training proceeded for 700,000 steps (mini-batches of size 4,096)

@roy7
Copy link
Collaborator

roy7 commented Dec 26, 2017

You're correct on #1, if I understand what you are asking. Training doesn't go on forever though.

Starting with current best-network, you'll download latest games, set up 250K window, then train. Every so many steps a network is made for match testing. Once the max # of steps is hit (128K now I believe), download all new games, set up to 250K window chunks, and start over using best-network's weight file.

So training is always done with best-network as starting point, and new data. Multiple steps files are tested. Repeat.

@evanroberts85
Copy link

I am pretty sure @gcp said he does not use the learning rates exactly as stated by AlphaGo, but adjusts them for our lower batch size.

@gcp
Copy link
Member

gcp commented Dec 26, 2017

Or alternatively: training is stopped, all chunks are removed, the most recent 250k games are dumped, and training resumes?

^^ This. Mostly because it was simpler.

What policy loss is currently seen in training?

step 13900, policy=3.09926 mse=0.164995 reg=0.0880906 total=3.35235 (4411.87 pos/s)

policy loss is very dependent on the training data. With supervised data you should be able to get around 1.5-1.8. For self-play it is gradually dropping as the program gets stronger.

Why is the batch size set to 256?

It's a default that fits on the RAM of most people's cards. I use 512 and cut the learning rate. Batch size 4096 won't fit on any card for the huge original AlphaGo resnet, they used 64 GPU x 64 batches or something similar.

Note that the default learning rate in the code is for supervised learning. (I forgot to change this on the training machine at some point...)

@gcp
Copy link
Member

gcp commented Dec 26, 2017

Note that the network in use is 5x64, not 6x128 as the code defaults. This obviously also affects the expected loss.

The defaults in the code are what was used for the supervised network in the README.

@jjoshua2
Copy link

jjoshua2 commented Dec 26, 2017

I guess if we are running into memory limits we can't make use of the recent paper from Google "Don't Decay the Learning Rate, Increase the Batch Size". It points out speed increases and reduced need for hyperparamter tuning. https://arxiv.org/pdf/1711.00489.pdf

@ywrt
Copy link
Contributor Author

ywrt commented Dec 26, 2017

I just noticed that the summaries have been uploaded to https://sjeng.org/zero/ which I very much appreciate! Thank you :)

step 13900, policy=3.09926 mse=0.164995 reg=0.0880906 total=3.35235 (4411.87 pos/s)

That's a decent pos/s! May I ask what hardware are you using for that? Or is it based on 16-bit floats?

policy loss is very dependent on the training data. With supervised data you should be able to get around 1.5-1.8.

ack.

It's a default that fits on the RAM of most people's cards. I use 512 and cut the learning rate. Batch size 4096 won't fit on any card for the huge original AlphaGo resnet, they used 64 GPU x 64 batches or something similar.

That makes sense, thanks.

Note that the network in use is 5x64, not 6x128 as the code defaults. This obviously also affects the expected loss.

Ahh! Everything makes much more sense now! :)

Hmmm. When running with a 5x64 network my GPU isn't saturated any more when training. bummer. Better go make it faster!

@ywrt
Copy link
Contributor Author

ywrt commented Dec 26, 2017

Hmmm. When running with a 5x64 network my GPU isn't saturated any more when training. bummer. Better go make it faster!

Funny story time. I started investigating how I could further improve the speed. Whilst heading down that path, I noticed that the benchmark didn't give the same numbers it did yesterday. Or rather, it did sometimes but was very erratic. Much confusion and head-scratching. clean, rebuild. pin threads to cores. make random code changes. construct voodoo doll. etc etc

Eventually I notice that it's faster after I've been writing code for a while!?! I check cpu frequency while benchmark is running and see that it's running less than half-speed. Deep into thermal throttling!

Open machine, look inside, notice that a cable has moved and is now interfering with the CPU fan, stopping it turning. Move cable. Add cable tie. Benchmarks run twice as fast and no longer erratic. Run full training. It's GPU-bound, not CPU bound. all done :)

@gcp
Copy link
Member

gcp commented Dec 27, 2017

May I ask what hardware are you using for that?

The training machine is a dedicated Ryzen 1700 with a GTX 1080 Ti on Ubuntu 16.04, with a TF 1.4 compile linked to cuDNN 7.

@gcp
Copy link
Member

gcp commented Dec 27, 2017

I guess if we are running into memory limits we can't make use of the recent paper from Google "Don't Decay the Learning Rate, Increase the Batch Size". It points out speed increases and reduced need for hyperparamter tuning.

There's very little in this paper that isn't already well known or has even been discussed at length here.

Larger batch sizes are faster.

Batch size and learning rate are directly related.

PSA: If you read a machine learning paper (there are a shit ton coming out daily) and it claims some improvement (hint: they all do, it's hard to publish a paper if you don't), the odds that it would help this project are about 0 unless the paper is now being quoted by every new paper coming out, and has been added as a standard feature in every DL framework.

@roy7
Copy link
Collaborator

roy7 commented Dec 27, 2017

@gcp Do you have other optimizations pulled in other than just using /next? Since your pos/sec is twice mine.

@ywrt
Copy link
Contributor Author

ywrt commented Dec 27, 2017

@roy7 I think this is due to use a 5x64 network? e.g. in training/tf/tfprocess.py set

RESIDUAL_FILTERS = 64
RESIDUAL_BLOCKS = 5

Changing that, I go from ~1300 pos/s to ~3300 pos/s on my hardware.

@roy7
Copy link
Collaborator

roy7 commented Dec 27, 2017

Oh maybe I've been training wrong sized networks all along, lol. I just run the scripts with the defaults as they come from the /next branch. Looks like 6x128 is the default. Ouch!

@jjoshua2
Copy link

Does that mean that some of the accepted networks were trained with 6x128? Even though all the self play is still 5x64

@ywrt
Copy link
Contributor Author

ywrt commented Dec 27, 2017

@gcp comments above:

Note that the network in use is 5x64

I assume that means that this is the shape of the network being distributed to self-play clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants