-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? #20
Comments
Hi @Zeta36! Yes, getting the network away from draws does seem to be a challenge. First step is to validate that with supervised training we can get to a reasonable strength, and the search is not making any huge blunders. Then potentially doing some self-play with that network and seeing if it can learn to improve itself. |
DeepMind in the paper does not say they get rid of the draws. Moreover, they say they count as a draw any game taking more than n number of movements (n>=average movements in a chess game) so the bias for the NN stuck in a local minimum saying always 0 would be even greater. I don't know it's all a very strange. |
I do think the dual head helps, both the probability head and the value head are backpropagated. What happens when you remove the draw games? |
But the dual head ends the value part with an independent FC layer than can be easily backprop to have weights near to zero so the head would return always 0 at the same time that would not affect too much the rest of the common network. I'm not so sure the dual head can solve the problem. |
The network does appear to (slowly) be getting better from the supervised learning. This happened after I dropped the learning rate from 0.05 to 0.005. Latest results:
|
@glinscott can you examine the FC layer on the value head in the latest weights.txt? Show some statistics? I agree with @Zeta36 it would indeed be problematic if they converge to 0. I'll see if I can write out tf.example bytes and learn from that. Good idea. |
@glinscott you have to focus on the mse loss (the head value part). If I'm correct, the model (the head part) should converge fast (around 0.66, with near 33% of accuracy) and get stuck in returning always (very near) zero values. |
@Zeta36 when playing games, the network is definitely returning values not close to zero. Eg. in the following position, we can see the network is learning that white is winning. Unfortunately, the policy has it playing Ne5, which is a repetition...
|
@glinscott do you have a summary of your current value head loss (and accuracy)? |
Here are the latest training steps:
The MSE loss is divided by 4 to match Google's results though. |
hmmmm...well, maybe I'm wrong but I don't know why I'm wrong. It should get stuck. It's simple statistics. You are using supervised data, aren't you? Do you know if the PGN you are using is biased in some direction (with little or null zero games or something like that)? Also, how many movements are you using for the optimization? If you are using little (or biased) movements the NN could find a way to escape the convergence into 0. |
The pgn is from @gcp, who used SF self-play games (https://sjeng.org/dl/sftrain_clean.pgn.xz). Stats indicate it's a normal sample of high-level chess games.
|
Mmmm, perhaps it's a normal sample of high-level chess games, but there is an statistical bias in those results. Too much white wins versus black. That could cause the model to escape the fast convergence into zero. I don't want to bother you too much, but it would be great to check with a 33% (white wins), 33% draws and 33% black wins. Maybe it would be just enough training with the same number of 0-1 and 1-0 results to rule out my position. And this could be important because with a random initial sel-play model there will be probably same white and black wins in the beginning (what could cause if I'm correct a fast and bad convergence of the value head). |
@Zeta36 agreed, I do want to implement that. But the network at least appears to be learning well with these games so far: And the latest results of the test match against the random network: |
Nice results! Did you have a chance to play against the model yourself to see if it may be overfitting to "high-level" games? I believe I had a problem early on where all my training data had little material variance so the model had no chance to learn material imbalances (and the idea behind introducing variance/dirichlet noise to the self-play generator could partly be to "remind" the network what bad positions look like, improving training stability). Another idea to eliminate the white/black bias that @Zeta36 brought up is to flip the board (isomorphic transform the space of "black-to-move" positions into "white-to-move" positions). I think this also introduces extra regularization at no cost. |
It's definitely still making a ton of mistakes. Here is an example game with the new network as white, random weights as black:
You can see that after 12. Bd3 Ke7 the queen is hanging, but the network doesn't see it. I suspect setting the network to predict just the move SF played is hurting here. When training in self-play, it learns to predict the probabilities of all the moves visited by UCT, which seems much more robust. Still this seems like a solid baseline. Good proof the code is mostly working too :). I've uploaded these weights to https://github.com/glinscott/lczero-weights/blob/master/best_supervised_5_64.txt.gz if others are interested. The match score against the random network was:
(amusingly it lost the very last game) |
Curiously the game seems to play more or less like @Akababa best results. Did your model already converge @glinscott? Maybe we could be facing some kind of limitation in the model we are using (following AZ0). |
@Zeta36 I don't think it's converged yet, but I've had to drop the training rate twice so far. So probably getting close: |
I see. And don't you think it's a little strange that with so a very good convergence (in both policy and value head) the model still does so many (and clear) errors? @Akababa tried lot of times (also with very good convergence and with very much rich and big PGN files) but the model never was able to get rid of all blunders (and reach nearly no strategy in a long range). Maybe there could be something profound and related with the model itself that makes it unable to learn in a good manner (something DeepMind didn't explain in its last paper or something like that). |
I actually think it's very reasonable for the model to blunder when trained on one-hot Stockfish moves in very balanced positions. I'd say this is a limitation of supervised training in this fashion, and as @glinscott pointed out training on MCTS visit counts would be more robust because it's a good local policy improvement operator. But iirc leelazero reached a decent level by this same method so it could simply be lack of MCTS playouts or chess requires more layers. Also without the validation split it's impossible to measure overfitting. |
With only 80k games there is very likely to be a vast overfit in the value layer. You can control for this by lowering the MSE weighting in the total loss formula. Or by having more games (I'm still generating a ton), which is far better. From this discussion I don't see why you think the network will converge to always returning 0.5 (or 0 in -1...1 range) though. It will reach that point quickly, but it will also be able to see that when one side has much more material (which is a trivial function of the inputs), the losses drop heavily when predicting a win for that side. And so on. It is harder/slower to train to imbalanced categories but it certainly isn't impossible. |
It also had a ton more data! About 2.5M games times 8 rotations (not possible to use rotations in chess). |
@gcp, but once the model converge quickly (and deeply) into ignoring the input and saying always 0, the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible, isn't it? "Go" does not have this (theoretical) problem since it has no zero result game (no draws). About the number of games: @Akababa tried with a huge dataset of really big PGN files in here: https://github.com/Zeta36/chess-alpha-zero, and even he got a good convergence of mse and policy the model could play more or less "good" games although he was not able to remove all blunders nor to get a model able to play some strategic game in the long range. What do you think? |
But why would it converge that way? Not all games are drawn. The mispredictions on the won or lost games will still cause big gradients. The predictions of 0 on the drawn games will cause no gradient. Predicting 0.1 on a game that was drawn will produce a tiny gradient compared to mispredicting a win as a draw. Still plenty of room to make the distinctions, as there is strong pressure (and actual gradient direction) on the network to correctly predict the 40% of games that are not drawn.
Predicting 0 is still an easy way to get a quick loss on MSE error compared to always predicting a constant value, or predicting randomly. So I'm not sure this argument even works!
This page talks about "1000" games or "3000" games. You want 2 orders of magnitude more, or you will get MSE overfitting, as I already pointed out. |
@Akababa tried with a lot more than 3000 games (even when in the readme he only talks about 1000 games). About the quick convergence: Imagine you have a worker playing self-games (or a parser of PGN files). You get a chunk of 15.000 movements for example. In average you will have more or less 5000 movements with z=-1, 5000 with z=0, and 5000 z=1. Then you run the optimization worker an it reads the 15.000 movements to backprop. The loss function is mse so there is a clear an easy(fast) way to reach a local minimum when the model ignore the input (the board) and always says 0 (the mean) as result (simply by taking the weights of the last FC layer to values very near 0). In this case, the optimization will reach quickly a high accuracy rate of 1/3 (33%) and a low mse loss of 0.66. After very few steps the model would get stuck with this function and no further improvement could make the model to take him out of this fast (and deep) convergence (because the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible). I'm just talking theoretically and if the dataset has not bias but an almost perfect proportion between win, loss and draws (-1,0,1) and if we use a really big dataset unable to be overfitted. |
I just don't see why this would happen for the reasons already stated. And unless there were bugs, does the practical result from @glinscott not show that it does not? With 40% of the games producing a strong gradient towards anything that remotely correlates with the population count of many input planes[1], and all draw games producing an ever tinier gradient towards "always 0", how could you get stuck deeply in a local minimum? It just sounds so weird. [1] This is why my dataset has no resignations, FWIW. |
@glinscott results are based in a very biased dataset (with doubles z=1 respect to z=-1) and with a small an easy to overfit number of movements. I'd be great to check this with a very much big number of movements and with an unbiased dataset (with near equals numbers of z=-1, 0 and 1). |
@gcp, I don't know why you are so "aggressive" against the possibility that we could have an issue with the value head. I think the facts are there and all point out in that direction: look for example into the @Error323 MSE plots (scaling up by 4), and look how the games above seems to follow the policy head but as soon as the policy fails (after the first opening movements easy to memorize by the policy) the game turns chaotic. With a minimal value head "good" convergence the game should follow all the time a much more smooth strategy. |
I'm objecting to the idea that if you have a large dataset with: 70% class A That it's impossible to build a classifier because it will always return class A. I don't disagree that it's hard to train the value head, but the constant implication that it is not possible (which flies in the face of established evidence, unless you believe DeepMind made things up) I just do not understand. And there is data supporting that. If it had fatally converged to 0, the network would not be returning 68% in some positions. |
@Error323 What is the current preferred training method after your changes? This is what I would do now, judging from the code (before I used
|
I'm not gonna fight against you, @gcp. But I think we should study at least the possibility that the value head is not learning because of some trouble (or incapacity). And I don't say it's impossible to train the head value but that maybe it's not possible in the way we are trying until now. And the facts are over there (MSE loss and game quality are there even you don't want to see them). |
From @kiudee's results, the discussion here and my own experience thusfar, I think we need to train for more steps and change the learning rate decay function. Ideally we would alter the learningrate decay function such that the lr decreases when both the policy- and value head have converged as @gcp suggested in another thread. From @kiudee's graphs and mine you can clearly see that the MSE and or policy has not reached convergence during some of the lr steps and so we decrease the learning rates too soon, which in turn results in slower learning --> significant more steps. Regarding training method, indeed steps 1, 2 and 3 are correct @kiudee. This helps speedup the trainingprocess for me by a factor of 2. The conversion of the planes to binary proto is very cpu intensive, much more so than in the Go version as we have more planes to consume per position. |
If you read my post you'll see I don't disagree that the results aren't good, I'm objecting over the implications that it's impossible. Another idea: train with a larger regularizer (large enough to control MSE over-fits), but drop the weighting of the policy network. See if you get better MSE performance than we have now. After this has happened, make the weights equal and go to a normal regularizer. I will fiddle with some of these. |
I never said it's impossible, just that maybe it's impossible in the way we are currently trying. So I guess we are more or less agree.
Good idea. And there are a more lot of things we can try before going as crazy to generate more data. By the way, this is the kind of "tricks" I think DeepMind could be "hiding" in their last (and short) paper about AZ. |
DeepMind didn't talk about doing SL in the last paper at all. We're doing it because it's interesting and useful for debugging. |
Deepmind has always been in the habit of publishing what worked for them, not what didn't work (and they're not exactly the only ones in academia doing that). So there's no alternative for our projects here to trying out multiple things and finding out what works and what doesn't, even if that results in occasional setbacks. |
Indeed, it's very likely they didn't need to because the whole purpose of that paper was to show the generic power of the algorithm. They only had to adapt the movegen and NN I/O. I think that before we start selfplay we should get a real solid understanding of the architecture through experimentation and good results in SL mode, that is, similar to the SL results of Go. |
Of course I know that, but a SL has to be able to train the model if a self-train pipeline is able to. So as we are not able to train the model in a correct way using SL (and huge of data) then the sel-play will not be able to do it much better. Supposing that just more (self)data will fix this issue with the head value is very naive. |
I'm not so sure about that. Despite that Leela Zero obviously trains well with self-play data, I didn't manage to get a good SL trained network either without messing with the MSE values. And the same was true in the DeepMind paper. The RL data is more diverse and more specifically shows the holes in the knowledge of the current network, in a way SL data does not do. Consider this: what do you think the draw rate is for the SL data versus the RL data? |
I'm not sure about your position, @gcp. I doubt a model that fully fails in a supervised manner will do much better in a self-play data training. Anyway if you prefer working in the brute manner playing some millions (self)games and look to see what happens instead of studying the issue theoretically. Well, it's an (expensive) option. We'll see in some months the results ;). |
@Zeta36 The whole Deepmind approach of Alphago Zero and Alpha Zero is about using self-play reinforcement learning, presumable because they got much better results from doing so than using SL only. Sure it costs resources, but it is really only self-play with MCTS and exploration from Dirichlet noise and temperature that will fix any knowledge holes in the training data set. At some point leela-chess will also have to start a self-play pipeline to figure out what really works in the chess context. Studying the issue theoretically is a good starting point but can only get you so far. |
Of course, self-train is the final goal. I just comment that try self-training with a model that fully fails in a SL manner it's not a good idea. |
I think "fully fails" is an unfair characterisation for the RL results by @glinscott, @Error323, @kiudee and also @Akababa and yourself in the chess-alpha-zero context. There are already multiple networks that play at a reasonable human level even if not close to the level of Stockfish. I doubt that would happen with the network size in use here anyway, but for the moment the point is to prove the setup works and will results in a network that can significantly outperform a SL net of the same dimensions. Comparing it to Stockfish or other top engines is just not realistic yet. |
Fully fails in the sense it is unable to learn any chess concept or strategy and only seems to follow the well converged policy head. In this sense all the plots about the MSE showed until now have a ver strange and suspicious convergence curve (always falling quickly around ~0.6 once scale up by 4). My personal opinion it's just we should wait for the model to be able to get a realistic head convergence in a SL manner that show us the network has learn any minimal value strategy. But as I say it's just my opinion, if you want to start already creating millions of self-training games, it's fine. We'll see in some months if we get something or not. |
I agree that without a value head that has at least some sensible output you can't even really be sure the thing is properly debugged. For all you know your tree search has black and white scores reversed or something, you'd never notice. I'm just saying that RL data has some better properties that make it easier to work with. |
I really don't understand all the fuss about the value head. Why can't we just remove the color plane from the input? |
@gcp I was processing your latest stockfish games and got an invalid game:
Scid also only reads 223k games. |
They're from pgn-extract so that would be strange. I uploaded a new one this morning, this loads fine for me in Scid (vs PC). |
I've been looking at the training, and it looks like the policy and value heads are from the AlphaGo Zero paper, not the Alpha Zero paper? That would affect things! I'll file a separate issue. |
Just to make sure that there were no network errors, I get the following checksum:
edit: Found the problem. The download using
The problem was Firefox - it was (repeatedly) downloading an incomplete file. Any idea why? |
Initial download failed and the partially downloaded file was not deleted? (And, I think, then trying to resume after I had updated the file) If you check your downloads dir you might find .part file. Downloading with Firefox works fine here FWIW. |
@gcp what do you mean? I don't recall the AlphaZero paper talking about the NN architecture. |
@Error323 They describe the policy output head. It needs to be quite different than from Go because Go moves only have to squares, which map nicely to a single board. But not so for chess. See issue #47 for more details. It looks like here the policy output from Leela Zero was copied without making the required change. |
@Error323 has proven that it is indeed possible :). Congrats! |
I've been working for a while in this kind of AZ projects. I started even this project some months ago: https://github.com/Zeta36/chess-alpha-zero
And I have a doubt I'd like to ask you. It's about the training of the value output. If I backprop the value output always with an integer (-1, 0, or 1) the NN should quickly be stuck in a local minimal function ignoring the input and always returning the mean of this 3 values (in this case 0). I mean, as soon as the NN learn to return always near 0 values ignoring the input planes, there will be no more improvements since it will have a high accuracy value (>25%) almost intermediately after some steps.
In fact I did a toy experiment to confirm this. As I mention the NN was unable to improve after reaching 33% accuracy (~0.65 loss in mean square). And this has sense if the NN is always returning 0 (very near zero values). Imagine we introduce a dataset of 150 games: ~50 are -1, ~50 are 0 and ~50 are 1. If the NN learns to say always near 0, we get an instant loss of (mse): 100/150 ~ 0.66 and an accuracy of ~33% (1/3).
How the hell did DeepMind to train the value network with just 3 integer values to backpropagate??
I thought the tournament selection (the evaluation worker) was implicated in helping to overcome this local minimum (stabilizing the training), but in its last paper they say they removed the eval process (??)...so, I don't really know what to think.
I don't know neither if the self-play can help in this issue. In last term we are still back-propagating an integer from a domain of just 3 values.
Btw, you can see in our project in https://github.com/Zeta36/chess-alpha-zero we got some "good" results (in a supervised way) but I suspect it was all thanks to the policy network guiding the MCTS exploration (with a value function returning always near 0 values).
What do you think about this?
The text was updated successfully, but these errors were encountered: