Is there any way for us to contribute to the automated training? #234

isty2e · 2017-12-03T04:35:54Z

Over 80k games are generated from the current best network, from which we could’ve trained a few networks, if not many.

I think all of us understand that gcp is the leader of this project and his efforts are totally voluntary, but I’m afraid we might lose some contributors if things slow down. Especially, many among the community are eager to see the results from the recently suggested training methods, frequently if possible.

Clearly, gcp can be busy for a couple of days, things might happen, or he might not be able to train a new network every 25k games. So I feel like it is really time for the training process to be automated (with distributed testing for which some people are already working on). Unfortunately, the training is performed on the server side and it is hard for the community to settle this issue. Is there any method for us to help automating/pipelining the process?

earthengine · 2017-12-03T10:37:25Z

According to basic security concern, any network contributions have to be certified and trusted by the community. Right now having a single trained network source is the way we ensure security.

The potential attack/damage to the process includes:

Broken networks - the uploaded network might be broken or not compatible
Weak networks - the "best" network is actually weaker than existing networks (it is true that the current process can have weaker "best" network, but here "weaker" means it is actually trained with less games)
Dishonest networks - it is polluted with human games (which breaks the whole purpose of the project) or other irrelevant sources, or with additional constraints represents additional domain knowledge beyond the Alphago Zero paper

For the first and second issue, a verification step might (not sure about the second) be introduced to prevent those to happen, but I didn't see any technical way to prevent the last attack. How can we trust any one that claims he follows all the rules we should be following for the project?

l1t1 · 2017-12-03T10:46:20Z

may be contribute money to buy more compute ability from cloud is more reality

lwins-lights · 2017-12-03T10:59:39Z

@earthengine From theoretic aspect we DO have some technique to verify honesty of a network, by an interactive verification which is somewhat complicated. But the cost is very high (I thought that it would cost up to about 100x time).

isty2e · 2017-12-03T11:08:58Z

I can think of a couple of solutions here:

A network is uploaded with detailed training method and the training set so the others can reproduce it.
Gcp shares the code he's using for training along with some server side scripts or APIs or whatever so the the community can work on it.

I think with suggested methods (especially using less training steps) training itself is manageable with a single machine, so I originally posted this issue with 2. on my mind, but if other solutions can work, that would be okay too.

lwins-lights · 2017-12-03T11:16:16Z

@isty2e Great! To do this we should only specify and verify
a) training set used
b) random bits used, or alternatively, random seed
Also, the similar approach could be applied to deal with the third issue presented by @earthengine .

gcp · 2017-12-03T20:00:14Z

All the code I use for training is and has always been in the source repo, and I've been uploading gigabytes of data in #167 exactly to ensure others can run the training, which is exactly how the 22373747 network has been found!

So yes, obviously you can do this, and other people have already successfully contributed in this way.

gcp · 2017-12-03T20:17:20Z

Verifying a trained network is very easy: if it's a good gain over the previous one it takes a few hours for anyone with a fast machine to confirm it with autogtp. (If it's a minor gain, it may take half a day)

isty2e · 2017-12-04T02:04:18Z

@gcp Is there any plan to automate everything on the server side? I still think that is the most clear solution.

bood · 2017-12-04T05:36:24Z

@gcp I wanted to try the training myself too. I know the script/data are there already, but it appears you've changed some parameters in recent training e.g. training steps, learning rate etc.

Are they updated in the script too? I see no recent commits for changing these parameters. If not, could you point to me where training steps are configured?

I can only find learning rate here: https://github.com/gcp/leela-zero/blob/master/training/tf/tfprocess.py#L80. Don't know how to change the training steps though.

gcp · 2017-12-04T07:56:53Z

@gcp Is there any plan to automate everything on the server side? I still think that is the most clear solution.

It's being automated on my side (the server doesn't have a GPU or anything). Right now it's a few scripts that I launch and check the output for (but which I can do, e.g., from my phone).

gcp · 2017-12-04T08:00:51Z

it appears you've changed some parameters in recent training e.g. training steps, learning rate etc.

The discussion about how to set the learning rate is in the #78 thread. You should read the AGZ paper and understand how learning rate corresponds to batch size. (You won't be able to use the AGZ batch size on most common GPU)

I have no idea where you get the stuff about "training steps".

bood · 2017-12-04T08:41:08Z

@gcp I'm talking about these:

running 1000 training steps (scaled for minimatch size) and evaluating immediately

I restarted the training from 92c658d weights again with all of the previous training data and trained it again 10k steps

But are they even the same thing...?

gcp · 2017-12-04T09:48:45Z

If you start the training it will run and print e.g.

step 24200, policy loss=4.98316 mse=0.0975871 (644.309 pos/s)
step 24300, policy loss=4.98177 mse=0.0966891 (634.301 pos/s)

So it's just a question of how long you let it run.

bood · 2017-12-04T15:50:51Z

@gcp Aha, no wonder no exit condition in tfprocess.py

Thanks for clarifying, I'm pretty new to Deep Learning and TensorFlow stuff, so forgive me about dummy questions.

Just to be clear, now you just stops the training when step reaches 1000? Not using the 10000 (10k) others mentioned earlier in #78?

isty2e · 2017-12-04T15:57:47Z

@bood The number of training steps is scaled according to batch size, so it will be presumably 1000*2048/256=8000.

isty2e · 2017-12-20T03:24:24Z

Now networks are trained on a regular if not daily basis, and evaluation is distributed. Closing this issue.

isty2e closed this as completed Dec 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way for us to contribute to the automated training? #234

Is there any way for us to contribute to the automated training? #234

isty2e commented Dec 3, 2017 •

edited

earthengine commented Dec 3, 2017 •

edited

l1t1 commented Dec 3, 2017

lwins-lights commented Dec 3, 2017 •

edited

isty2e commented Dec 3, 2017 •

edited

lwins-lights commented Dec 3, 2017

gcp commented Dec 3, 2017 •

edited

gcp commented Dec 3, 2017 •

edited

isty2e commented Dec 4, 2017

bood commented Dec 4, 2017

gcp commented Dec 4, 2017

gcp commented Dec 4, 2017

bood commented Dec 4, 2017 •

edited

gcp commented Dec 4, 2017 •

edited

bood commented Dec 4, 2017

isty2e commented Dec 4, 2017

isty2e commented Dec 20, 2017

Is there any way for us to contribute to the automated training? #234

Is there any way for us to contribute to the automated training? #234

Comments

isty2e commented Dec 3, 2017 • edited

earthengine commented Dec 3, 2017 • edited

l1t1 commented Dec 3, 2017

lwins-lights commented Dec 3, 2017 • edited

isty2e commented Dec 3, 2017 • edited

lwins-lights commented Dec 3, 2017

gcp commented Dec 3, 2017 • edited

gcp commented Dec 3, 2017 • edited

isty2e commented Dec 4, 2017

bood commented Dec 4, 2017

gcp commented Dec 4, 2017

gcp commented Dec 4, 2017

bood commented Dec 4, 2017 • edited

gcp commented Dec 4, 2017 • edited

bood commented Dec 4, 2017

isty2e commented Dec 4, 2017

isty2e commented Dec 20, 2017

isty2e commented Dec 3, 2017 •

edited

earthengine commented Dec 3, 2017 •

edited

lwins-lights commented Dec 3, 2017 •

edited

isty2e commented Dec 3, 2017 •

edited

gcp commented Dec 3, 2017 •

edited

gcp commented Dec 3, 2017 •

edited

bood commented Dec 4, 2017 •

edited

gcp commented Dec 4, 2017 •

edited