-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train value head against q at root after search and game result #84
Comments
I think for a strategy like that to be sound, you would need a confidence interval, and weight the training depending on the confidence interval width or something. Would probably need some fancy math to decide exactly how much. |
Maybe just use a formula that weights the root-q after search with the number of moves untill the game was decided and averages that with the actual game result. Also a threshold can be used so that the actual game result will always be weighted more heavily than root-q. This of course needs some thought, but it seems like a sensible try... |
While Root-Q is just average of visits, I'm not sold on giving it too much weight. Probably would be better to use most visit node Q instead - even then I'd be much happier if the Q calculated came from a more advanced technique than averaging. |
I think it makes sense to do something like a Taylor series of all future
scores.
Maybe estimate for speed with current scores and score 4 moves later and
final game result.
…On Fri, Jun 29, 2018, 10:08 AM Tilps ***@***.***> wrote:
While Root-Q is just average of visits, I'm not sold on giving it too much
weight. Probably would be better to use most visit node Q instead - even
then I'd be much happier if the Q calculated came from a more advanced
technique than averaging.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO6INBqKIwSUDZ0lFhAs9T_X1PZYv2CWks5uBjTDgaJpZM4UndbS>
.
|
@Tilps: Hmm, not sure what you mean by average of visits. Root q (after searching a move to play) is an average of all NN evals and terminal values (-1,0,+1) of the visited nodes in and should be (and is) a predictor of game result. But i agree that any kind of weighted average be it by distance to game end or over all root-q in a game (jjoshua2's idea) are approximations. Still, i would advocate to first try something simple before trying to make it theoretically bulletproof. Besides the actual formula to combine q's with game results there are these main schemes:
The first scheme induces more correlation but raises accuracy, while the second has less correlation but looses accuracy. Would be very interesting to test these schemes, because if anything the value head has a lot less variance to train on currently, and this might be a possible solution. |
I think for testing, it's better to start off simple. My first thought is to train against (N * z + (L - N) * q) / L, where z represents the game result, and N and L represent the current move number and game length respectively. This is basically the same thing as your (2). EDIT: It's worth noting that testing this would require a new training run, since the current format for training data does not include q-values. |
Oracle just released this a couple of days ago, interesting read - seems like they tried this idea with success. |
That's what convinced me that this may not be a bad idea. In general I'm against putting too much weight on the search results, since the search results are dependent of course on what the value and policy are already trained to be. Training policy against value is fine in this manner, IMO, because value is not trained against any parameter that is self-referential (the game outcome), so that even though policy is self-referential to value, it is thru value indirectly non-self-referential. Training value against the search-result value does eliminate that non-self-referentiality, which is philosophically a big problem in my opinion. Of course it is absolutely true that game outcome, while not self referential, is extremely noisy, so it's good to see that introducing the far-less-noisy search result into training, while removing non-self-referentiality, smooths the noise enough to improve training. Between this and the fact that it would seriously simplify resign handling, I'm thinking that we should just bite the bullet and include each q in training data now. @Tilps thoughts on at least this last part? |
Completely agree: q should be included. The additional space is negligible. If and how we use it is another matter and should be deliberated carefully. For example we could put at least 50% of the weight on actual rame result and split the rest on root-q and actual gameresult weighed by distance to game end. So one move before the game finishes is getting 100% trained on actual game result, while the first move of the game is getting trained 50% on actual game result and 50% on root-q. Also the schemes from Oracle are worth a look. |
I like the shifting-proportion-per-move idea, but only if the initial game result weight is non-zero, as you suggest. I'd be on board to try that. |
Before we put Q in the training data, we should decide what Q to put in there. |
Another thought - people thinking of changing the Q to Z ratio linearly from game start to game end, probably should consider that Z is probably much more effective both at the start and close to the end. End is obvious enough, for the start Z gains strength because there are a small number of start positions compared to games played, so there are lots of Z per position to create a good average. On the other hand Q's limited window in the early game probably lowers its value. |
@Tilps Good points. Before adressing the main points, there is one sentence i am confused about: "Before we put Q in the training data, we should decide what Q to put in there." I would think its natural to augment the data generated from playing training games by the root q for every move of all games. I would have thought this would give us full flexibilty in how we later combine it with Z for actual learning. So i would not consider this major issue. Now to the main points. (1) If we decide on needing a confidence interval, that would be very easy to do, as i already coded measuring the empirical variance (which should be better than any theoretical derived measure) for each nodes q. Its only a few lines of code and one additonal number to output. I do not yet have any opinion if such a confidence interval is necessary or not. (2) The question if such schemes only learn faster but as a potential downside might maxout earlier can only be answered empirically. What else than to try it? (3) Regarding Temperature: Much of what you said could also be said about Z. As temperature also somewhat discorrelates game outcome from positions (especially early in the game). But i agree this needs some thought. (4) Yes, simple averaging is not best, But thats exactly why we are discussing it here. (5) Double Yes to your observation that it should probably not be linear (in moves to games conclusion) at very early game positions. This brings some sort of quadratic regression to mind. Proposition: Lets say we add the q (at root) of every search to the training data. Then we could do a regression (with quadratic terms) on Z with predictors root-q and moves_till_game_ends (and maybe other stuff we come up, like confidence bounds). This would confirm/disconfirm some of the questions raised and directly imply a function to use for combining q and Z. |
Part of the reason why it matters what we put there is because the format is not easily changed and if we go down the confidence interval route, there is probably value in sending the confidence interval itself rather than just the interval centre. |
I think we should not rush to add something now. I feel like it's feature creep, and we don't know the best value to put there. I also think after lc0 is finally released there are other tests we can do such as removing history which could also change the training data format. When lc0 is finally released we can use the test pipeline to for experiments like these. |
FYI, using (Q+z)/2 as value head target was also proposed in LZ (see for example |
I think similar things were tested a few times and they didn't bring improvements. Closing this for now as it's a stale issue. |
I'd reopen. The testing showed improvement. |
Do you refer to the Oracle' result again or to something else ? |
iirc jio or someone did a ccrl test that showed good things. |
@oscardssmith This would have implied not just a CCRL test but, before that test, modifiing training data format, generating sufficient volume of selfllay games and training a few generations of networks ... . IIUC and this was done by someone for chess (and not just Connect4), then I completely missed that information. |
@Cyanogenoid tried this. He modified the training code and played lot of self-play games. I remember that it had good results, but can't find them in Discord. Code is in his repo: https://github.com/Cyanogenoid/lczero-training/tree/q and https://github.com/Cyanogenoid/lc0/tree/q |
@Ttl Thanks a lot for the link ! |
@Ttl This is an engine used in AZ paper and AZ like engine |
lightvector (KataGo author):
|
I think this is implemented? |
Just an idea: Why not train the value head not only against the game result but against root q after search averaged with game result.
Shouldn't this alleviate somewhat the problems of value head learning? Or has this already been tested?
The text was updated successfully, but these errors were encountered: