Is Leelaz trained off all the the games, or just the more recent ones? #484

ywrt · 2017-12-24T04:55:01Z

I'm assuming it's trained off the more recent ones? If so, how many are used?

As an aside: are the scripts used to drive http://zero.sjeng.org/ checked in anywhere? I suspect that would answer all of my questions :)

OmnipotentEntity · 2017-12-24T06:12:55Z

250k games are used in training typically.

ywrt · 2017-12-24T06:33:26Z

I'm assuming that the script used for training isn't training/tf/parse.py, but something else?
(Because parse.py is CPU bound parsing the training files, and not GPU bound actually training the network)

gcp · 2017-12-24T07:40:53Z

It is parse.py.

The network is fairly small and TensorFlow/python is rather slow so it ends up largely CPU limited indeed.

gcp · 2017-12-24T08:06:54Z

As an aside: are the scripts used to drive http://zero.sjeng.org/ checked in anywhere? I suspect that would answer all of my questions :)

FWIW the server is @roy7's work and he was interested in open sourcing it, but I think he hasn't gotten around to cleaning up the code yet. It was originally just a few lines of node.js that shoved the submitted data into mongodb, but due to the distributed testing it has gotten more complicated.

The code that dumps the training data is here: https://pastebin.mozilla.org/9075155

Obviously it's useless without the full DB which I won't publish right now because it has IP addresses etc. I might anonymize it after the project.

ywrt · 2017-12-26T20:37:35Z

Few more questions...

I'm assuming that training continues from the previous network? That is: At some interval: New chunks are created from games that have arrived, some old chunks are removed and then training resumes?

Or alternatively: training is stopped, all chunks are removed, the most recent 250k games are dumped, and training resumes?

What policy loss is currently seen in training?
(My from-scratch training is currently showing step 157600, policy=2.90912 mse=0.155379 reg=0.181329 total=3.24583 (1320.76 pos/s)
Why is the batch size set to 256? As far as I can tell, this is too small for the learning rate (as indicated by saturation in the policy loss)?
I suspect the batch size should be increased to at least 1024 (or alternatively: reduce the learning rate, but larger batch is preferable). FWIW: The AlphaZero paper mentions a batch size of 4096. Training proceeded for 700,000 steps (mini-batches of size 4,096)

roy7 · 2017-12-26T20:49:01Z

You're correct on #1, if I understand what you are asking. Training doesn't go on forever though.

Starting with current best-network, you'll download latest games, set up 250K window, then train. Every so many steps a network is made for match testing. Once the max # of steps is hit (128K now I believe), download all new games, set up to 250K window chunks, and start over using best-network's weight file.

So training is always done with best-network as starting point, and new data. Multiple steps files are tested. Repeat.

evanroberts85 · 2017-12-26T20:54:34Z

I am pretty sure @gcp said he does not use the learning rates exactly as stated by AlphaGo, but adjusts them for our lower batch size.

gcp · 2017-12-26T21:12:19Z

Or alternatively: training is stopped, all chunks are removed, the most recent 250k games are dumped, and training resumes?

^^ This. Mostly because it was simpler.

What policy loss is currently seen in training?

step 13900, policy=3.09926 mse=0.164995 reg=0.0880906 total=3.35235 (4411.87 pos/s)

policy loss is very dependent on the training data. With supervised data you should be able to get around 1.5-1.8. For self-play it is gradually dropping as the program gets stronger.

Why is the batch size set to 256?

It's a default that fits on the RAM of most people's cards. I use 512 and cut the learning rate. Batch size 4096 won't fit on any card for the huge original AlphaGo resnet, they used 64 GPU x 64 batches or something similar.

Note that the default learning rate in the code is for supervised learning. (I forgot to change this on the training machine at some point...)

gcp · 2017-12-26T21:27:34Z

Note that the network in use is 5x64, not 6x128 as the code defaults. This obviously also affects the expected loss.

The defaults in the code are what was used for the supervised network in the README.

jjoshua2 · 2017-12-26T21:35:44Z

I guess if we are running into memory limits we can't make use of the recent paper from Google "Don't Decay the Learning Rate, Increase the Batch Size". It points out speed increases and reduced need for hyperparamter tuning. https://arxiv.org/pdf/1711.00489.pdf

ywrt · 2017-12-26T21:40:54Z

I just noticed that the summaries have been uploaded to https://sjeng.org/zero/ which I very much appreciate! Thank you :)

step 13900, policy=3.09926 mse=0.164995 reg=0.0880906 total=3.35235 (4411.87 pos/s)

That's a decent pos/s! May I ask what hardware are you using for that? Or is it based on 16-bit floats?

policy loss is very dependent on the training data. With supervised data you should be able to get around 1.5-1.8.

ack.

It's a default that fits on the RAM of most people's cards. I use 512 and cut the learning rate. Batch size 4096 won't fit on any card for the huge original AlphaGo resnet, they used 64 GPU x 64 batches or something similar.

That makes sense, thanks.

Note that the network in use is 5x64, not 6x128 as the code defaults. This obviously also affects the expected loss.

Ahh! Everything makes much more sense now! :)

Hmmm. When running with a 5x64 network my GPU isn't saturated any more when training. bummer. Better go make it faster!

ywrt · 2017-12-26T23:06:32Z

Hmmm. When running with a 5x64 network my GPU isn't saturated any more when training. bummer. Better go make it faster!

Funny story time. I started investigating how I could further improve the speed. Whilst heading down that path, I noticed that the benchmark didn't give the same numbers it did yesterday. Or rather, it did sometimes but was very erratic. Much confusion and head-scratching. clean, rebuild. pin threads to cores. make random code changes. construct voodoo doll. etc etc

Eventually I notice that it's faster after I've been writing code for a while!?! I check cpu frequency while benchmark is running and see that it's running less than half-speed. Deep into thermal throttling!

Open machine, look inside, notice that a cable has moved and is now interfering with the CPU fan, stopping it turning. Move cable. Add cable tie. Benchmarks run twice as fast and no longer erratic. Run full training. It's GPU-bound, not CPU bound. all done :)

gcp · 2017-12-27T07:48:05Z

May I ask what hardware are you using for that?

The training machine is a dedicated Ryzen 1700 with a GTX 1080 Ti on Ubuntu 16.04, with a TF 1.4 compile linked to cuDNN 7.

gcp · 2017-12-27T08:03:31Z

I guess if we are running into memory limits we can't make use of the recent paper from Google "Don't Decay the Learning Rate, Increase the Batch Size". It points out speed increases and reduced need for hyperparamter tuning.

There's very little in this paper that isn't already well known or has even been discussed at length here.

Larger batch sizes are faster.

Batch size and learning rate are directly related.

PSA: If you read a machine learning paper (there are a shit ton coming out daily) and it claims some improvement (hint: they all do, it's hard to publish a paper if you don't), the odds that it would help this project are about 0 unless the paper is now being quoted by every new paper coming out, and has been added as a standard feature in every DL framework.

roy7 · 2017-12-27T15:27:31Z

@gcp Do you have other optimizations pulled in other than just using /next? Since your pos/sec is twice mine.

ywrt · 2017-12-27T19:56:51Z

@roy7 I think this is due to use a 5x64 network? e.g. in training/tf/tfprocess.py set

RESIDUAL_FILTERS = 64
RESIDUAL_BLOCKS = 5

Changing that, I go from ~1300 pos/s to ~3300 pos/s on my hardware.

roy7 · 2017-12-27T20:05:41Z

Oh maybe I've been training wrong sized networks all along, lol. I just run the scripts with the defaults as they come from the /next branch. Looks like 6x128 is the default. Ouch!

jjoshua2 · 2017-12-27T20:12:31Z

Does that mean that some of the accepted networks were trained with 6x128? Even though all the self play is still 5x64

ywrt · 2017-12-27T20:56:22Z

@gcp comments above:

Note that the network in use is 5x64

I assume that means that this is the shape of the network being distributed to self-play clients.

ywrt mentioned this issue Dec 26, 2017

Speed up record processing in tensorflow training. #493

Merged

ywrt closed this as completed Dec 27, 2017

alreadydone mentioned this issue Mar 1, 2018

Version 0.10 released - Next steps #591

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Leelaz trained off all the the games, or just the more recent ones? #484

Is Leelaz trained off all the the games, or just the more recent ones? #484

ywrt commented Dec 24, 2017

OmnipotentEntity commented Dec 24, 2017

ywrt commented Dec 24, 2017

gcp commented Dec 24, 2017

gcp commented Dec 24, 2017

ywrt commented Dec 26, 2017

roy7 commented Dec 26, 2017

evanroberts85 commented Dec 26, 2017

gcp commented Dec 26, 2017 •

edited

gcp commented Dec 26, 2017

jjoshua2 commented Dec 26, 2017 •

edited

ywrt commented Dec 26, 2017

ywrt commented Dec 26, 2017

gcp commented Dec 27, 2017

gcp commented Dec 27, 2017

roy7 commented Dec 27, 2017

ywrt commented Dec 27, 2017

roy7 commented Dec 27, 2017

jjoshua2 commented Dec 27, 2017

ywrt commented Dec 27, 2017

Is Leelaz trained off all the the games, or just the more recent ones? #484

Is Leelaz trained off all the the games, or just the more recent ones? #484

Comments

ywrt commented Dec 24, 2017

OmnipotentEntity commented Dec 24, 2017

ywrt commented Dec 24, 2017

gcp commented Dec 24, 2017

gcp commented Dec 24, 2017

ywrt commented Dec 26, 2017

roy7 commented Dec 26, 2017

evanroberts85 commented Dec 26, 2017

gcp commented Dec 26, 2017 • edited

gcp commented Dec 26, 2017

jjoshua2 commented Dec 26, 2017 • edited

ywrt commented Dec 26, 2017

ywrt commented Dec 26, 2017

gcp commented Dec 27, 2017

gcp commented Dec 27, 2017

roy7 commented Dec 27, 2017

ywrt commented Dec 27, 2017

roy7 commented Dec 27, 2017

jjoshua2 commented Dec 27, 2017

ywrt commented Dec 27, 2017

gcp commented Dec 26, 2017 •

edited

jjoshua2 commented Dec 26, 2017 •

edited