Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train value head against q at root after search and game result #84

Closed
Videodr0me opened this issue Jun 14, 2018 · 26 comments
Closed

Train value head against q at root after search and game result #84

Videodr0me opened this issue Jun 14, 2018 · 26 comments

Comments

@Videodr0me
Copy link
Collaborator

Just an idea: Why not train the value head not only against the game result but against root q after search averaged with game result.

Shouldn't this alleviate somewhat the problems of value head learning? Or has this already been tested?

@Tilps
Copy link
Contributor

Tilps commented Jun 14, 2018

I think for a strategy like that to be sound, you would need a confidence interval, and weight the training depending on the confidence interval width or something. Would probably need some fancy math to decide exactly how much.

@Videodr0me
Copy link
Collaborator Author

Maybe just use a formula that weights the root-q after search with the number of moves untill the game was decided and averages that with the actual game result. Also a threshold can be used so that the actual game result will always be weighted more heavily than root-q. This of course needs some thought, but it seems like a sensible try...

@Tilps
Copy link
Contributor

Tilps commented Jun 29, 2018

While Root-Q is just average of visits, I'm not sold on giving it too much weight. Probably would be better to use most visit node Q instead - even then I'd be much happier if the Q calculated came from a more advanced technique than averaging.

@jjoshua2
Copy link
Contributor

jjoshua2 commented Jun 29, 2018 via email

@Videodr0me
Copy link
Collaborator Author

@Tilps: Hmm, not sure what you mean by average of visits. Root q (after searching a move to play) is an average of all NN evals and terminal values (-1,0,+1) of the visited nodes in and should be (and is) a predictor of game result.

But i agree that any kind of weighted average be it by distance to game end or over all root-q in a game (jjoshua2's idea) are approximations. Still, i would advocate to first try something simple before trying to make it theoretically bulletproof.

Besides the actual formula to combine q's with game results there are these main schemes:

  1. compute a "meta-q" from all searches that occured in the whole game, combine that with actual-game-result and use that for learning (this seems to me jjoshua2s approach)

  2. use each q individually and combine it with game result (maybe weighted by distance to game end)

The first scheme induces more correlation but raises accuracy, while the second has less correlation but looses accuracy. Would be very interesting to test these schemes, because if anything the value head has a lot less variance to train on currently, and this might be a possible solution.

@asymptomatic-tomato
Copy link

asymptomatic-tomato commented Jun 29, 2018

I think for testing, it's better to start off simple. My first thought is to train against (N * z + (L - N) * q) / L, where z represents the game result, and N and L represent the current move number and game length respectively. This is basically the same thing as your (2).

EDIT: It's worth noting that testing this would require a new training run, since the current format for training data does not include q-values.

@Videodr0me
Copy link
Collaborator Author

Oracle just released this a couple of days ago, interesting read - seems like they tried this idea with success.
https://medium.com/oracledevs/lessons-from-alphazero-part-4-improving-the-training-target-6efba2e71628

@dubslow
Copy link
Member

dubslow commented Jul 1, 2018

That's what convinced me that this may not be a bad idea.

In general I'm against putting too much weight on the search results, since the search results are dependent of course on what the value and policy are already trained to be. Training policy against value is fine in this manner, IMO, because value is not trained against any parameter that is self-referential (the game outcome), so that even though policy is self-referential to value, it is thru value indirectly non-self-referential. Training value against the search-result value does eliminate that non-self-referentiality, which is philosophically a big problem in my opinion.

Of course it is absolutely true that game outcome, while not self referential, is extremely noisy, so it's good to see that introducing the far-less-noisy search result into training, while removing non-self-referentiality, smooths the noise enough to improve training.

Between this and the fact that it would seriously simplify resign handling, I'm thinking that we should just bite the bullet and include each q in training data now. @Tilps thoughts on at least this last part?

@Videodr0me
Copy link
Collaborator Author

Completely agree: q should be included. The additional space is negligible. If and how we use it is another matter and should be deliberated carefully. For example we could put at least 50% of the weight on actual rame result and split the rest on root-q and actual gameresult weighed by distance to game end. So one move before the game finishes is getting 100% trained on actual game result, while the first move of the game is getting trained 50% on actual game result and 50% on root-q. Also the schemes from Oracle are worth a look.

@dubslow
Copy link
Member

dubslow commented Jul 1, 2018

I like the shifting-proportion-per-move idea, but only if the initial game result weight is non-zero, as you suggest. I'd be on board to try that.

@Tilps
Copy link
Contributor

Tilps commented Jul 2, 2018

Before we put Q in the training data, we should decide what Q to put in there.
The article supporting using (Q+Z) / 2 as training target contains a few concerning oversights. I think its reasonable to think it would train faster, but convincing me that it doesn't cap out lower looks a more difficult prospect.
(The oversights are not noticing A0 uses temperature until the last move and thinking that Q of root node represents the expected value of the position. Q of root node represents an approximation of the expected value of the position, but only really in the sense of the game being played out with temperature = 1, which is in turn contradictory to their thought that turning off temperature part way through is essential... Its interesting that using Q like this probably dampens some of the learning speed improvements of resign - since resigned positions now train to a target closer to their resign value rather than always to the loss score).
If we're going to use Q to actually learn 'better' rather than just maybe faster, simple averaging root Q is probably not the ideal option. A confidence interval derived Q like LeelaGoZero was experimenting with (but I lost track of) seems a much better training target.

@Tilps
Copy link
Contributor

Tilps commented Jul 2, 2018

Another thought - people thinking of changing the Q to Z ratio linearly from game start to game end, probably should consider that Z is probably much more effective both at the start and close to the end. End is obvious enough, for the start Z gains strength because there are a small number of start positions compared to games played, so there are lots of Z per position to create a good average. On the other hand Q's limited window in the early game probably lowers its value.

@Videodr0me
Copy link
Collaborator Author

Videodr0me commented Jul 2, 2018

@Tilps Good points. Before adressing the main points, there is one sentence i am confused about: "Before we put Q in the training data, we should decide what Q to put in there." I would think its natural to augment the data generated from playing training games by the root q for every move of all games. I would have thought this would give us full flexibilty in how we later combine it with Z for actual learning. So i would not consider this major issue.

Now to the main points.

(1) If we decide on needing a confidence interval, that would be very easy to do, as i already coded measuring the empirical variance (which should be better than any theoretical derived measure) for each nodes q. Its only a few lines of code and one additonal number to output. I do not yet have any opinion if such a confidence interval is necessary or not.

(2) The question if such schemes only learn faster but as a potential downside might maxout earlier can only be answered empirically. What else than to try it?

(3) Regarding Temperature: Much of what you said could also be said about Z. As temperature also somewhat discorrelates game outcome from positions (especially early in the game). But i agree this needs some thought.

(4) Yes, simple averaging is not best, But thats exactly why we are discussing it here.

(5) Double Yes to your observation that it should probably not be linear (in moves to games conclusion) at very early game positions. This brings some sort of quadratic regression to mind.

Proposition: Lets say we add the q (at root) of every search to the training data. Then we could do a regression (with quadratic terms) on Z with predictors root-q and moves_till_game_ends (and maybe other stuff we come up, like confidence bounds). This would confirm/disconfirm some of the questions raised and directly imply a function to use for combining q and Z.

@Tilps
Copy link
Contributor

Tilps commented Jul 2, 2018

Part of the reason why it matters what we put there is because the format is not easily changed and if we go down the confidence interval route, there is probably value in sending the confidence interval itself rather than just the interval centre.
The other part is that changing the meaning of a field is a version raising event just like adding a new field would be, so its not 'free'.

@killerducky
Copy link
Contributor

I think we should not rush to add something now. I feel like it's feature creep, and we don't know the best value to put there. I also think after lc0 is finally released there are other tests we can do such as removing history which could also change the training data format.

When lc0 is finally released we can use the test pipeline to for experiments like these.

@Ishinoshita
Copy link

FYI, using (Q+z)/2 as value head target was also proposed in LZ (see for example
leela-zero/leela-zero#1480 (comment)), but gcp seemed pretty convinced it could only slow down learning.
Regarding Oracle re-implementation, I note that they only experimented with Connect-4, which may not reflect the long-delayed reward character of the game of go. So, gcp's comment may still stands for go.

@mooskagh
Copy link
Member

mooskagh commented Jan 1, 2019

I think similar things were tested a few times and they didn't bring improvements. Closing this for now as it's a stale issue.

@mooskagh mooskagh closed this as completed Jan 1, 2019
@oscardssmith
Copy link
Contributor

I'd reopen. The testing showed improvement.

@mooskagh mooskagh reopened this Jan 1, 2019
@Ishinoshita
Copy link

@oscardssmith

The testing showed improvement.

Do you refer to the Oracle' result again or to something else ?

@oscardssmith
Copy link
Contributor

iirc jio or someone did a ccrl test that showed good things.

@Ishinoshita
Copy link

@oscardssmith This would have implied not just a CCRL test but, before that test, modifiing training data format, generating sufficient volume of selfllay games and training a few generations of networks ... .

IIUC and this was done by someone for chess (and not just Connect4), then I completely missed that information.

@Ttl
Copy link
Member

Ttl commented Jan 3, 2019

@Cyanogenoid tried this. He modified the training code and played lot of self-play games. I remember that it had good results, but can't find them in Discord.

Code is in his repo: https://github.com/Cyanogenoid/lczero-training/tree/q and https://github.com/Cyanogenoid/lc0/tree/q

@Ishinoshita
Copy link

@Ttl Thanks a lot for the link !

@lp200
Copy link

lp200 commented Jan 17, 2019

@Ttl
I think that Q and Z average training is worth testing.
that method worked very well for shogi
it was much stronger than Z only training.

This is an engine used in AZ paper
https://github.com/yaneurao/YaneuraOu/blob/master/source/learn/learner.cpp#L1133

and AZ like engine
https://github.com/TadaoYamaoka/DeepLearningShogi/blob/master/dlshogi/train_rl_policy_with_value_using_hcpe_bootstrap.py#L74

@killerducky
Copy link
Contributor

lightvector (KataGo author):

Some brief results - for my bot at least (still having ownership targets, etc) switching an ongoing training run near the strength of LZ110 to use 0.8 z + 0.2 (ewma of future q with halflife 12) has almost no effect compared to not switching it. Switching it to use 0.5 z + 0.5 (ewma of future q with halflife 12) ended up slightly worse, reaching perhaps 25 Elo lower after the same amount of time - which was about 50M more training samples, perhaps a bit less than 1M games.
(I forked the same training run 3 ways, pausing it and copying all the data into separate runs, then restarting those runs from the stopped point but with the difference of training target)
uncertainty bounds on the 25 elo are a bit large, 1 stdev is like 12ish, so 2 stdevs would give an interval about (0,50)
but it's clear it's either worse or same, probably not better

@mooskagh
Copy link
Member

I think this is implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests