Train value head against q at root after search and game result #84

Videodr0me · 2018-06-14T07:42:03Z

Just an idea: Why not train the value head not only against the game result but against root q after search averaged with game result.

Shouldn't this alleviate somewhat the problems of value head learning? Or has this already been tested?

Tilps · 2018-06-14T09:47:21Z

I think for a strategy like that to be sound, you would need a confidence interval, and weight the training depending on the confidence interval width or something. Would probably need some fancy math to decide exactly how much.

Videodr0me · 2018-06-29T12:50:01Z

Maybe just use a formula that weights the root-q after search with the number of moves untill the game was decided and averages that with the actual game result. Also a threshold can be used so that the actual game result will always be weighted more heavily than root-q. This of course needs some thought, but it seems like a sensible try...

Tilps · 2018-06-29T14:08:02Z

While Root-Q is just average of visits, I'm not sold on giving it too much weight. Probably would be better to use most visit node Q instead - even then I'd be much happier if the Q calculated came from a more advanced technique than averaging.

jjoshua2 · 2018-06-29T14:12:31Z

I think it makes sense to do something like a Taylor series of all future scores. Maybe estimate for speed with current scores and score 4 moves later and final game result.

…

On Fri, Jun 29, 2018, 10:08 AM Tilps ***@***.***> wrote: While Root-Q is just average of visits, I'm not sold on giving it too much weight. Probably would be better to use most visit node Q instead - even then I'd be much happier if the Q calculated came from a more advanced technique than averaging. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AO6INBqKIwSUDZ0lFhAs9T_X1PZYv2CWks5uBjTDgaJpZM4UndbS> .

Videodr0me · 2018-06-29T14:29:41Z

@Tilps: Hmm, not sure what you mean by average of visits. Root q (after searching a move to play) is an average of all NN evals and terminal values (-1,0,+1) of the visited nodes in and should be (and is) a predictor of game result.

But i agree that any kind of weighted average be it by distance to game end or over all root-q in a game (jjoshua2's idea) are approximations. Still, i would advocate to first try something simple before trying to make it theoretically bulletproof.

Besides the actual formula to combine q's with game results there are these main schemes:

compute a "meta-q" from all searches that occured in the whole game, combine that with actual-game-result and use that for learning (this seems to me jjoshua2s approach)
use each q individually and combine it with game result (maybe weighted by distance to game end)

The first scheme induces more correlation but raises accuracy, while the second has less correlation but looses accuracy. Would be very interesting to test these schemes, because if anything the value head has a lot less variance to train on currently, and this might be a possible solution.

asymptomatic-tomato · 2018-06-29T20:23:31Z

I think for testing, it's better to start off simple. My first thought is to train against (N * z + (L - N) * q) / L, where z represents the game result, and N and L represent the current move number and game length respectively. This is basically the same thing as your (2).

EDIT: It's worth noting that testing this would require a new training run, since the current format for training data does not include q-values.

Videodr0me · 2018-07-01T10:04:38Z

Oracle just released this a couple of days ago, interesting read - seems like they tried this idea with success.
https://medium.com/oracledevs/lessons-from-alphazero-part-4-improving-the-training-target-6efba2e71628

dubslow · 2018-07-01T18:22:08Z

That's what convinced me that this may not be a bad idea.

In general I'm against putting too much weight on the search results, since the search results are dependent of course on what the value and policy are already trained to be. Training policy against value is fine in this manner, IMO, because value is not trained against any parameter that is self-referential (the game outcome), so that even though policy is self-referential to value, it is thru value indirectly non-self-referential. Training value against the search-result value does eliminate that non-self-referentiality, which is philosophically a big problem in my opinion.

Of course it is absolutely true that game outcome, while not self referential, is extremely noisy, so it's good to see that introducing the far-less-noisy search result into training, while removing non-self-referentiality, smooths the noise enough to improve training.

Between this and the fact that it would seriously simplify resign handling, I'm thinking that we should just bite the bullet and include each q in training data now. @Tilps thoughts on at least this last part?

Videodr0me · 2018-07-01T20:08:31Z

Completely agree: q should be included. The additional space is negligible. If and how we use it is another matter and should be deliberated carefully. For example we could put at least 50% of the weight on actual rame result and split the rest on root-q and actual gameresult weighed by distance to game end. So one move before the game finishes is getting 100% trained on actual game result, while the first move of the game is getting trained 50% on actual game result and 50% on root-q. Also the schemes from Oracle are worth a look.

dubslow · 2018-07-01T20:17:27Z

I like the shifting-proportion-per-move idea, but only if the initial game result weight is non-zero, as you suggest. I'd be on board to try that.

Tilps · 2018-07-02T01:08:48Z

Before we put Q in the training data, we should decide what Q to put in there.
The article supporting using (Q+Z) / 2 as training target contains a few concerning oversights. I think its reasonable to think it would train faster, but convincing me that it doesn't cap out lower looks a more difficult prospect.
(The oversights are not noticing A0 uses temperature until the last move and thinking that Q of root node represents the expected value of the position. Q of root node represents an approximation of the expected value of the position, but only really in the sense of the game being played out with temperature = 1, which is in turn contradictory to their thought that turning off temperature part way through is essential... Its interesting that using Q like this probably dampens some of the learning speed improvements of resign - since resigned positions now train to a target closer to their resign value rather than always to the loss score).
If we're going to use Q to actually learn 'better' rather than just maybe faster, simple averaging root Q is probably not the ideal option. A confidence interval derived Q like LeelaGoZero was experimenting with (but I lost track of) seems a much better training target.

Tilps · 2018-07-02T03:00:56Z

Another thought - people thinking of changing the Q to Z ratio linearly from game start to game end, probably should consider that Z is probably much more effective both at the start and close to the end. End is obvious enough, for the start Z gains strength because there are a small number of start positions compared to games played, so there are lots of Z per position to create a good average. On the other hand Q's limited window in the early game probably lowers its value.

Videodr0me · 2018-07-02T09:05:23Z

@Tilps Good points. Before adressing the main points, there is one sentence i am confused about: "Before we put Q in the training data, we should decide what Q to put in there." I would think its natural to augment the data generated from playing training games by the root q for every move of all games. I would have thought this would give us full flexibilty in how we later combine it with Z for actual learning. So i would not consider this major issue.

Now to the main points.

(1) If we decide on needing a confidence interval, that would be very easy to do, as i already coded measuring the empirical variance (which should be better than any theoretical derived measure) for each nodes q. Its only a few lines of code and one additonal number to output. I do not yet have any opinion if such a confidence interval is necessary or not.

(2) The question if such schemes only learn faster but as a potential downside might maxout earlier can only be answered empirically. What else than to try it?

(3) Regarding Temperature: Much of what you said could also be said about Z. As temperature also somewhat discorrelates game outcome from positions (especially early in the game). But i agree this needs some thought.

(4) Yes, simple averaging is not best, But thats exactly why we are discussing it here.

(5) Double Yes to your observation that it should probably not be linear (in moves to games conclusion) at very early game positions. This brings some sort of quadratic regression to mind.

Proposition: Lets say we add the q (at root) of every search to the training data. Then we could do a regression (with quadratic terms) on Z with predictors root-q and moves_till_game_ends (and maybe other stuff we come up, like confidence bounds). This would confirm/disconfirm some of the questions raised and directly imply a function to use for combining q and Z.

Tilps · 2018-07-02T09:35:59Z

Part of the reason why it matters what we put there is because the format is not easily changed and if we go down the confidence interval route, there is probably value in sending the confidence interval itself rather than just the interval centre.
The other part is that changing the meaning of a field is a version raising event just like adding a new field would be, so its not 'free'.

killerducky · 2018-07-02T14:42:01Z

I think we should not rush to add something now. I feel like it's feature creep, and we don't know the best value to put there. I also think after lc0 is finally released there are other tests we can do such as removing history which could also change the training data format.

When lc0 is finally released we can use the test pipeline to for experiments like these.

Ishinoshita · 2018-07-21T13:19:08Z

FYI, using (Q+z)/2 as value head target was also proposed in LZ (see for example
leela-zero/leela-zero#1480 (comment)), but gcp seemed pretty convinced it could only slow down learning.
Regarding Oracle re-implementation, I note that they only experimented with Connect-4, which may not reflect the long-delayed reward character of the game of go. So, gcp's comment may still stands for go.

mooskagh · 2019-01-01T22:04:43Z

I think similar things were tested a few times and they didn't bring improvements. Closing this for now as it's a stale issue.

oscardssmith · 2019-01-01T23:49:35Z

I'd reopen. The testing showed improvement.

Ishinoshita · 2019-01-02T09:36:49Z

@oscardssmith

The testing showed improvement.

Do you refer to the Oracle' result again or to something else ?

oscardssmith · 2019-01-03T00:18:03Z

iirc jio or someone did a ccrl test that showed good things.

Ishinoshita · 2019-01-03T09:51:01Z

@oscardssmith This would have implied not just a CCRL test but, before that test, modifiing training data format, generating sufficient volume of selfllay games and training a few generations of networks ... .

IIUC and this was done by someone for chess (and not just Connect4), then I completely missed that information.

Ttl · 2019-01-03T10:25:55Z

@Cyanogenoid tried this. He modified the training code and played lot of self-play games. I remember that it had good results, but can't find them in Discord.

Code is in his repo: https://github.com/Cyanogenoid/lczero-training/tree/q and https://github.com/Cyanogenoid/lc0/tree/q

Ishinoshita · 2019-01-03T14:25:17Z

@Ttl Thanks a lot for the link !

lp200 · 2019-01-17T16:00:19Z

@Ttl
I think that Q and Z average training is worth testing.
that method worked very well for shogi
it was much stronger than Z only training.

This is an engine used in AZ paper
https://github.com/yaneurao/YaneuraOu/blob/master/source/learn/learner.cpp#L1133

and AZ like engine
https://github.com/TadaoYamaoka/DeepLearningShogi/blob/master/dlshogi/train_rl_policy_with_value_using_hcpe_bootstrap.py#L74

killerducky · 2019-03-27T15:58:51Z

lightvector (KataGo author):

Some brief results - for my bot at least (still having ownership targets, etc) switching an ongoing training run near the strength of LZ110 to use 0.8 z + 0.2 (ewma of future q with halflife 12) has almost no effect compared to not switching it. Switching it to use 0.5 z + 0.5 (ewma of future q with halflife 12) ended up slightly worse, reaching perhaps 25 Elo lower after the same amount of time - which was about 50M more training samples, perhaps a bit less than 1M games.
(I forked the same training run 3 ways, pausing it and copying all the data into separate runs, then restarting those runs from the stopped point but with the difference of training target)
uncertainty bounds on the 25 elo are a bit large, 1 stdev is like 12ish, so 2 stdevs would give an interval about (0,50)
but it's clear it's either worse or same, probably not better

mooskagh · 2020-02-27T18:49:03Z

I think this is implemented?

mooskagh closed this as completed Jan 1, 2019

mooskagh reopened this Jan 1, 2019

TFiFiE mentioned this issue Jul 26, 2019

Paper: questions and ideas lightvector/KataGo#3

Closed

mooskagh closed this as completed Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train value head against q at root after search and game result #84

Train value head against q at root after search and game result #84

Videodr0me commented Jun 14, 2018

Tilps commented Jun 14, 2018

Videodr0me commented Jun 29, 2018

Tilps commented Jun 29, 2018

jjoshua2 commented Jun 29, 2018 via email

Videodr0me commented Jun 29, 2018

asymptomatic-tomato commented Jun 29, 2018 •

edited

Loading

Videodr0me commented Jul 1, 2018

dubslow commented Jul 1, 2018

Videodr0me commented Jul 1, 2018

dubslow commented Jul 1, 2018

Tilps commented Jul 2, 2018

Tilps commented Jul 2, 2018

Videodr0me commented Jul 2, 2018 •

edited

Loading

Tilps commented Jul 2, 2018

killerducky commented Jul 2, 2018

Ishinoshita commented Jul 21, 2018

mooskagh commented Jan 1, 2019

oscardssmith commented Jan 1, 2019

Ishinoshita commented Jan 2, 2019

oscardssmith commented Jan 3, 2019

Ishinoshita commented Jan 3, 2019

Ttl commented Jan 3, 2019

Ishinoshita commented Jan 3, 2019

lp200 commented Jan 17, 2019

killerducky commented Mar 27, 2019

mooskagh commented Feb 27, 2020

Train value head against q at root after search and game result #84

Train value head against q at root after search and game result #84

Comments

Videodr0me commented Jun 14, 2018

Tilps commented Jun 14, 2018

Videodr0me commented Jun 29, 2018

Tilps commented Jun 29, 2018

jjoshua2 commented Jun 29, 2018 via email

Videodr0me commented Jun 29, 2018

asymptomatic-tomato commented Jun 29, 2018 • edited Loading

Videodr0me commented Jul 1, 2018

dubslow commented Jul 1, 2018

Videodr0me commented Jul 1, 2018

dubslow commented Jul 1, 2018

Tilps commented Jul 2, 2018

Tilps commented Jul 2, 2018

Videodr0me commented Jul 2, 2018 • edited Loading

Tilps commented Jul 2, 2018

killerducky commented Jul 2, 2018

Ishinoshita commented Jul 21, 2018

mooskagh commented Jan 1, 2019

oscardssmith commented Jan 1, 2019

Ishinoshita commented Jan 2, 2019

oscardssmith commented Jan 3, 2019

Ishinoshita commented Jan 3, 2019

Ttl commented Jan 3, 2019

Ishinoshita commented Jan 3, 2019

lp200 commented Jan 17, 2019

killerducky commented Mar 27, 2019

mooskagh commented Feb 27, 2020

asymptomatic-tomato commented Jun 29, 2018 •

edited

Loading

Videodr0me commented Jul 2, 2018 •

edited

Loading