Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? #20

Closed
Zeta36 opened this issue Jan 14, 2018 · 153 comments

Comments

@Zeta36
Copy link

Zeta36 commented Jan 14, 2018

I've been working for a while in this kind of AZ projects. I started even this project some months ago: https://github.com/Zeta36/chess-alpha-zero

And I have a doubt I'd like to ask you. It's about the training of the value output. If I backprop the value output always with an integer (-1, 0, or 1) the NN should quickly be stuck in a local minimal function ignoring the input and always returning the mean of this 3 values (in this case 0). I mean, as soon as the NN learn to return always near 0 values ignoring the input planes, there will be no more improvements since it will have a high accuracy value (>25%) almost intermediately after some steps.

In fact I did a toy experiment to confirm this. As I mention the NN was unable to improve after reaching 33% accuracy (~0.65 loss in mean square). And this has sense if the NN is always returning 0 (very near zero values). Imagine we introduce a dataset of 150 games: ~50 are -1, ~50 are 0 and ~50 are 1. If the NN learns to say always near 0, we get an instant loss of (mse): 100/150 ~ 0.66 and an accuracy of ~33% (1/3).

How the hell did DeepMind to train the value network with just 3 integer values to backpropagate??
I thought the tournament selection (the evaluation worker) was implicated in helping to overcome this local minimum (stabilizing the training), but in its last paper they say they removed the eval process (??)...so, I don't really know what to think.

I don't know neither if the self-play can help in this issue. In last term we are still back-propagating an integer from a domain of just 3 values.

Btw, you can see in our project in https://github.com/Zeta36/chess-alpha-zero we got some "good" results (in a supervised way) but I suspect it was all thanks to the policy network guiding the MCTS exploration (with a value function returning always near 0 values).

What do you think about this?

@Zeta36 Zeta36 changed the title Is possible to train the value output in a supervised manner? Is possible to train the value output in a supervised (or even self-play) manner? Jan 14, 2018
@Zeta36 Zeta36 changed the title Is possible to train the value output in a supervised (or even self-play) manner? Is possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? Jan 14, 2018
@Zeta36 Zeta36 changed the title Is possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? Is it possible to train the value output in a supervised (or even self-play) manner as AZ paper explains? Jan 14, 2018
@glinscott
Copy link
Owner

Hi @Zeta36! Yes, getting the network away from draws does seem to be a challenge. First step is to validate that with supervised training we can get to a reasonable strength, and the search is not making any huge blunders. Then potentially doing some self-play with that network and seeing if it can learn to improve itself.

@Zeta36
Copy link
Author

Zeta36 commented Jan 14, 2018

DeepMind in the paper does not say they get rid of the draws. Moreover, they say they count as a draw any game taking more than n number of movements (n>=average movements in a chess game) so the bias for the NN stuck in a local minimum saying always 0 would be even greater.

I don't know it's all a very strange.

@Error323
Copy link
Collaborator

I do think the dual head helps, both the probability head and the value head are backpropagated. What happens when you remove the draw games?

@Zeta36
Copy link
Author

Zeta36 commented Jan 14, 2018

But the dual head ends the value part with an independent FC layer than can be easily backprop to have weights near to zero so the head would return always 0 at the same time that would not affect too much the rest of the common network. I'm not so sure the dual head can solve the problem.

@glinscott
Copy link
Owner

The network does appear to (slowly) be getting better from the supervised learning. This happened after I dropped the learning rate from 0.05 to 0.005.

Latest results:

step 576000, policy loss=2.67612 mse=0.0616492 reg=0.400642 (3618.78 pos/s)
step 576000, training accuracy=28.0371%, mse=0.0602521

@Error323
Copy link
Collaborator

Error323 commented Jan 14, 2018

@glinscott can you examine the FC layer on the value head in the latest weights.txt? Show some statistics? I agree with @Zeta36 it would indeed be problematic if they converge to 0.

I'll see if I can write out tf.example bytes and learn from that. Good idea.

@Zeta36
Copy link
Author

Zeta36 commented Jan 14, 2018

@glinscott you have to focus on the mse loss (the head value part). If I'm correct, the model (the head part) should converge fast (around 0.66, with near 33% of accuracy) and get stuck in returning always (very near) zero values.

@glinscott
Copy link
Owner

@Zeta36 when playing games, the network is definitely returning values not close to zero. Eg. in the following position, we can see the network is learning that white is winning. Unfortunately, the policy has it playing Ne5, which is a repetition...

1r1n4/1p1P1b2/2p1p1nk/PPP3rp/3PPP2/3N3P/1R2N1B1/3Q1RK1 w - - 7 45
eval: 0.892773
Ne5 0.549349
Rb4 0.069966
fxg5+ 0.046496
Qa4 0.042237
Qc2 0.037499
Nb4 0.031261
Rb3 0.025240

@Zeta36
Copy link
Author

Zeta36 commented Jan 14, 2018

@glinscott do you have a summary of your current value head loss (and accuracy)?

@glinscott
Copy link
Owner

Here are the latest training steps:

step 580100, policy loss=2.64198 mse=0.0594456 reg=0.395409 (2585.08 pos/s)
step 580200, policy loss=2.64212 mse=0.0593968 reg=0.395284 (2826.09 pos/s)
step 580300, policy loss=2.64205 mse=0.0591733 reg=0.395159 (3744.2 pos/s)
step 580400, policy loss=2.64157 mse=0.0593887 reg=0.395034 (3165.17 pos/s)
step 580500, policy loss=2.64044 mse=0.0594478 reg=0.394909 (3733.18 pos/s)
step 580600, policy loss=2.63911 mse=0.0594092 reg=0.394784 (3285.71 pos/s)
step 580700, policy loss=2.63862 mse=0.0591817 reg=0.39466 (3275.92 pos/s)
step 580800, policy loss=2.63818 mse=0.0592512 reg=0.394536 (3754.25 pos/s)
step 580900, policy loss=2.6374 mse=0.0594759 reg=0.394412 (3215.41 pos/s)
step 581000, policy loss=2.63621 mse=0.0593435 reg=0.394289 (3251.14 pos/s)
step 581100, policy loss=2.63555 mse=0.0594918 reg=0.394166 (3780.63 pos/s)
step 581200, policy loss=2.6349 mse=0.0597201 reg=0.394043 (3297.47 pos/s)
step 581300, policy loss=2.63339 mse=0.0596737 reg=0.393921 (3817.32 pos/s)
step 581400, policy loss=2.6337 mse=0.0595412 reg=0.393798 (3266.03 pos/s)
step 581500, policy loss=2.6338 mse=0.0591958 reg=0.393676 (3276.31 pos/s)
step 581600, policy loss=2.63308 mse=0.0587861 reg=0.393554 (3790.01 pos/s)
step 581700, policy loss=2.63232 mse=0.0587501 reg=0.393433 (3224.15 pos/s)
step 581800, policy loss=2.6325 mse=0.0587086 reg=0.393312 (3715.49 pos/s)
step 581900, policy loss=2.6318 mse=0.0589177 reg=0.393191 (3213 pos/s)
step 582000, policy loss=2.6317 mse=0.0587718 reg=0.39307 (3260.37 pos/s)
step 582000, training accuracy=29.0234%, mse=0.0632076

The MSE loss is divided by 4 to match Google's results though.

@Zeta36
Copy link
Author

Zeta36 commented Jan 14, 2018

hmmmm...well, maybe I'm wrong but I don't know why I'm wrong. It should get stuck. It's simple statistics. You are using supervised data, aren't you? Do you know if the PGN you are using is biased in some direction (with little or null zero games or something like that)? Also, how many movements are you using for the optimization? If you are using little (or biased) movements the NN could find a way to escape the convergence into 0.

@glinscott
Copy link
Owner

The pgn is from @gcp, who used SF self-play games (https://sjeng.org/dl/sftrain_clean.pgn.xz). Stats indicate it's a normal sample of high-level chess games.

   6264 [Result "0-1"]
  15382 [Result "1-0"]
  53322 [Result "1/2-1/2"]

@Zeta36
Copy link
Author

Zeta36 commented Jan 14, 2018

Mmmm, perhaps it's a normal sample of high-level chess games, but there is an statistical bias in those results. Too much white wins versus black. That could cause the model to escape the fast convergence into zero.

I don't want to bother you too much, but it would be great to check with a 33% (white wins), 33% draws and 33% black wins. Maybe it would be just enough training with the same number of 0-1 and 1-0 results to rule out my position.

And this could be important because with a random initial sel-play model there will be probably same white and black wins in the beginning (what could cause if I'm correct a fast and bad convergence of the value head).

@glinscott
Copy link
Owner

@Zeta36 agreed, I do want to implement that. But the network at least appears to be learning well with these games so far:

image

And the latest results of the test match against the random network:
Score of lc_new vs lc_base: 24 - 1 - 8 [0.848] 33

@Akababa
Copy link
Contributor

Akababa commented Jan 15, 2018

Nice results! Did you have a chance to play against the model yourself to see if it may be overfitting to "high-level" games? I believe I had a problem early on where all my training data had little material variance so the model had no chance to learn material imbalances (and the idea behind introducing variance/dirichlet noise to the self-play generator could partly be to "remind" the network what bad positions look like, improving training stability).

Another idea to eliminate the white/black bias that @Zeta36 brought up is to flip the board (isomorphic transform the space of "black-to-move" positions into "white-to-move" positions). I think this also introduces extra regularization at no cost.

@glinscott
Copy link
Owner

It's definitely still making a ton of mistakes. Here is an example game with the new network as white, random weights as black:

1. d4 e6 2. c4 d6 3. Nc3 {0.52s} Bd7 {0.51s} 4. e4 {0.54s} Qe7 5. Nf3 {0.50s}
Nh6 6. Be2 f6 7. O-O f5 8. exf5 g5 9. fxg6 Bb5 10. cxb5 Qf6 11. Re1 Qxg6 12. Bd3
Ke7 {0.71s} 13. Bf4 {0.50s} Kf6 14. a4 Be7 {0.50s} 15. a5 Rf8 16. h3 Qh5
17. Bc4 {0.53s} Qc5 18. dxc5 {0.52s} Kf5 19. Qd2 {0.52s} Kg6 20. Bxe6 {0.52s}
Kh5 21. cxd6 {0.50s} Bf6 22. Nd5 {0.51s} Bc3 {0.50s} 23. Qxc3 Rh8 24. Qe5+ Nf5
25. Bf7# {White mates} 1-0

You can see that after 12. Bd3 Ke7 the queen is hanging, but the network doesn't see it. I suspect setting the network to predict just the move SF played is hurting here. When training in self-play, it learns to predict the probabilities of all the moves visited by UCT, which seems much more robust. Still this seems like a solid baseline. Good proof the code is mostly working too :).

I've uploaded these weights to https://github.com/glinscott/lczero-weights/blob/master/best_supervised_5_64.txt.gz if others are interested.

The match score against the random network was:

Score of lc_new vs lc_base: 95 - 1 - 4  [0.970] 100
ELO difference: 604

(amusingly it lost the very last game)

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2018

Curiously the game seems to play more or less like @Akababa best results. Did your model already converge @glinscott? Maybe we could be facing some kind of limitation in the model we are using (following AZ0).

@glinscott
Copy link
Owner

@Zeta36 I don't think it's converged yet, but I've had to drop the training rate twice so far. So probably getting close:
image

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2018

I see. And don't you think it's a little strange that with so a very good convergence (in both policy and value head) the model still does so many (and clear) errors? @Akababa tried lot of times (also with very good convergence and with very much rich and big PGN files) but the model never was able to get rid of all blunders (and reach nearly no strategy in a long range). Maybe there could be something profound and related with the model itself that makes it unable to learn in a good manner (something DeepMind didn't explain in its last paper or something like that).

@Akababa
Copy link
Contributor

Akababa commented Jan 15, 2018

I actually think it's very reasonable for the model to blunder when trained on one-hot Stockfish moves in very balanced positions. I'd say this is a limitation of supervised training in this fashion, and as @glinscott pointed out training on MCTS visit counts would be more robust because it's a good local policy improvement operator. But iirc leelazero reached a decent level by this same method so it could simply be lack of MCTS playouts or chess requires more layers.

Also without the validation split it's impossible to measure overfitting.

@gcp
Copy link
Contributor

gcp commented Jan 15, 2018

With only 80k games there is very likely to be a vast overfit in the value layer. You can control for this by lowering the MSE weighting in the total loss formula. Or by having more games (I'm still generating a ton), which is far better.

From this discussion I don't see why you think the network will converge to always returning 0.5 (or 0 in -1...1 range) though. It will reach that point quickly, but it will also be able to see that when one side has much more material (which is a trivial function of the inputs), the losses drop heavily when predicting a win for that side. And so on.

It is harder/slower to train to imbalanced categories but it certainly isn't impossible.

@gcp
Copy link
Contributor

gcp commented Jan 15, 2018

But iirc leelazero reached a decent level by this same method so it could simply be lack of MCTS playouts or chess requires more layers.

It also had a ton more data! About 2.5M games times 8 rotations (not possible to use rotations in chess).

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2018

@gcp, but once the model converge quickly (and deeply) into ignoring the input and saying always 0, the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible, isn't it?

"Go" does not have this (theoretical) problem since it has no zero result game (no draws).

About the number of games: @Akababa tried with a huge dataset of really big PGN files in here: https://github.com/Zeta36/chess-alpha-zero, and even he got a good convergence of mse and policy the model could play more or less "good" games although he was not able to remove all blunders nor to get a model able to play some strategic game in the long range.

What do you think?

@gcp
Copy link
Contributor

gcp commented Jan 15, 2018

but once the model converge quickly (and deeply) into ignoring the input and saying always 0, the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible, isn't it

But why would it converge that way? Not all games are drawn. The mispredictions on the won or lost games will still cause big gradients. The predictions of 0 on the drawn games will cause no gradient. Predicting 0.1 on a game that was drawn will produce a tiny gradient compared to mispredicting a win as a draw. Still plenty of room to make the distinctions, as there is strong pressure (and actual gradient direction) on the network to correctly predict the 40% of games that are not drawn.

"Go" does not have this (theoretical) problem since it has no zero result game (no draws).

Predicting 0 is still an easy way to get a quick loss on MSE error compared to always predicting a constant value, or predicting randomly. So I'm not sure this argument even works!

About the number of games: @Akababa tried with a huge dataset of really big PGN files in here

This page talks about "1000" games or "3000" games. You want 2 orders of magnitude more, or you will get MSE overfitting, as I already pointed out.

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2018

@Akababa tried with a lot more than 3000 games (even when in the readme he only talks about 1000 games).

About the quick convergence:

Imagine you have a worker playing self-games (or a parser of PGN files). You get a chunk of 15.000 movements for example. In average you will have more or less 5000 movements with z=-1, 5000 with z=0, and 5000 z=1.

Then you run the optimization worker an it reads the 15.000 movements to backprop. The loss function is mse so there is a clear an easy(fast) way to reach a local minimum when the model ignore the input (the board) and always says 0 (the mean) as result (simply by taking the weights of the last FC layer to values very near 0).

In this case, the optimization will reach quickly a high accuracy rate of 1/3 (33%) and a low mse loss of 0.66. After very few steps the model would get stuck with this function and no further improvement could make the model to take him out of this fast (and deep) convergence (because the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible).

I'm just talking theoretically and if the dataset has not bias but an almost perfect proportion between win, loss and draws (-1,0,1) and if we use a really big dataset unable to be overfitted.

@gcp
Copy link
Contributor

gcp commented Jan 15, 2018

because the weights of the last FC layer will be so very small that any try to backpropagate any gradient would be almost negligible).

I just don't see why this would happen for the reasons already stated. And unless there were bugs, does the practical result from @glinscott not show that it does not?

With 40% of the games producing a strong gradient towards anything that remotely correlates with the population count of many input planes[1], and all draw games producing an ever tinier gradient towards "always 0", how could you get stuck deeply in a local minimum? It just sounds so weird.

[1] This is why my dataset has no resignations, FWIW.

@Zeta36
Copy link
Author

Zeta36 commented Jan 15, 2018

@glinscott results are based in a very biased dataset (with doubles z=1 respect to z=-1) and with a small an easy to overfit number of movements.

I'd be great to check this with a very much big number of movements and with an unbiased dataset (with near equals numbers of z=-1, 0 and 1).

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

@gcp, I don't know why you are so "aggressive" against the possibility that we could have an issue with the value head. I think the facts are there and all point out in that direction: look for example into the @Error323 MSE plots (scaling up by 4), and look how the games above seems to follow the policy head but as soon as the policy fails (after the first opening movements easy to memorize by the policy) the game turns chaotic. With a minimal value head "good" convergence the game should follow all the time a much more smooth strategy.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

I'm objecting to the idea that if you have a large dataset with:

70% class A
20% class B
10% class C

That it's impossible to build a classifier because it will always return class A.

I don't disagree that it's hard to train the value head, but the constant implication that it is not possible (which flies in the face of established evidence, unless you believe DeepMind made things up) I just do not understand. And there is data supporting that. If it had fatally converged to 0, the network would not be returning 68% in some positions.

@kiudee
Copy link
Contributor

kiudee commented Jan 24, 2018

@Error323 What is the current preferred training method after your changes?

This is what I would do now, judging from the code (before I used parse.py on the chunks directly):

  1. Create chunks using lczero --supervise data.pgn
  2. Convert chunks into binary format using leela_to_proto.py
  3. Train using supervised_parse.py

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

I'm not gonna fight against you, @gcp. But I think we should study at least the possibility that the value head is not learning because of some trouble (or incapacity).

And I don't say it's impossible to train the head value but that maybe it's not possible in the way we are trying until now. And the facts are over there (MSE loss and game quality are there even you don't want to see them).

@Error323
Copy link
Collaborator

From @kiudee's results, the discussion here and my own experience thusfar, I think we need to train for more steps and change the learning rate decay function. Ideally we would alter the learningrate decay function such that the lr decreases when both the policy- and value head have converged as @gcp suggested in another thread. From @kiudee's graphs and mine you can clearly see that the MSE and or policy has not reached convergence during some of the lr steps and so we decrease the learning rates too soon, which in turn results in slower learning --> significant more steps.

Regarding training method, indeed steps 1, 2 and 3 are correct @kiudee. This helps speedup the trainingprocess for me by a factor of 2. The conversion of the planes to binary proto is very cpu intensive, much more so than in the Go version as we have more planes to consume per position.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

And I don't say it's impossible to train the head value but that maybe it's not possible in the way we are trying until now. And the facts are over there (MSE loss and game quality are there even you don't want to see them).

If you read my post you'll see I don't disagree that the results aren't good, I'm objecting over the implications that it's impossible.

Another idea: train with a larger regularizer (large enough to control MSE over-fits), but drop the weighting of the policy network. See if you get better MSE performance than we have now. After this has happened, make the weights equal and go to a normal regularizer.

I will fiddle with some of these.

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

If you read my post you'll see I don't disagree that the results aren't good, I'm objecting over the implications that it's impossible.

I never said it's impossible, just that maybe it's impossible in the way we are currently trying. So I guess we are more or less agree.

Another idea: train with a larger regularizer (large enough to control MSE over-fits), but drop the weighting of the policy network. See if you get better MSE performance than we have now. After this has happened, make the weights equal and go to a normal regularizer.

Good idea. And there are a more lot of things we can try before going as crazy to generate more data. By the way, this is the kind of "tricks" I think DeepMind could be "hiding" in their last (and short) paper about AZ.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

DeepMind didn't talk about doing SL in the last paper at all. We're doing it because it's interesting and useful for debugging.

@jkiliani
Copy link
Contributor

Good idea. And there are a more lot of things we can try before going as crazy to generate more data. By the way, this is the kind of "tricks" I think DeepMind could be "hiding" in their last (and short) paper about AZ.

Deepmind has always been in the habit of publishing what worked for them, not what didn't work (and they're not exactly the only ones in academia doing that). So there's no alternative for our projects here to trying out multiple things and finding out what works and what doesn't, even if that results in occasional setbacks.

@Error323
Copy link
Collaborator

Error323 commented Jan 24, 2018

DeepMind didn't talk about doing SL in the last paper at all. We're doing it because it's interesting and useful for debugging.

Indeed, it's very likely they didn't need to because the whole purpose of that paper was to show the generic power of the algorithm. They only had to adapt the movegen and NN I/O. I think that before we start selfplay we should get a real solid understanding of the architecture through experimentation and good results in SL mode, that is, similar to the SL results of Go.

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

DeepMind didn't talk about doing SL in the last paper at all. We're doing it because it's interesting and useful for debugging.

Of course I know that, but a SL has to be able to train the model if a self-train pipeline is able to. So as we are not able to train the model in a correct way using SL (and huge of data) then the sel-play will not be able to do it much better.

Supposing that just more (self)data will fix this issue with the head value is very naive.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

I'm not so sure about that. Despite that Leela Zero obviously trains well with self-play data, I didn't manage to get a good SL trained network either without messing with the MSE values. And the same was true in the DeepMind paper.

The RL data is more diverse and more specifically shows the holes in the knowledge of the current network, in a way SL data does not do.

Consider this: what do you think the draw rate is for the SL data versus the RL data?

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

I'm not sure about your position, @gcp. I doubt a model that fully fails in a supervised manner will do much better in a self-play data training.

Anyway if you prefer working in the brute manner playing some millions (self)games and look to see what happens instead of studying the issue theoretically. Well, it's an (expensive) option.

We'll see in some months the results ;).

@jkiliani
Copy link
Contributor

@Zeta36 The whole Deepmind approach of Alphago Zero and Alpha Zero is about using self-play reinforcement learning, presumable because they got much better results from doing so than using SL only. Sure it costs resources, but it is really only self-play with MCTS and exploration from Dirichlet noise and temperature that will fix any knowledge holes in the training data set. At some point leela-chess will also have to start a self-play pipeline to figure out what really works in the chess context. Studying the issue theoretically is a good starting point but can only get you so far.

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

Of course, self-train is the final goal. I just comment that try self-training with a model that fully fails in a SL manner it's not a good idea.

@jkiliani
Copy link
Contributor

I think "fully fails" is an unfair characterisation for the RL results by @glinscott, @Error323, @kiudee and also @Akababa and yourself in the chess-alpha-zero context. There are already multiple networks that play at a reasonable human level even if not close to the level of Stockfish. I doubt that would happen with the network size in use here anyway, but for the moment the point is to prove the setup works and will results in a network that can significantly outperform a SL net of the same dimensions. Comparing it to Stockfish or other top engines is just not realistic yet.

@Zeta36
Copy link
Author

Zeta36 commented Jan 24, 2018

Fully fails in the sense it is unable to learn any chess concept or strategy and only seems to follow the well converged policy head. In this sense all the plots about the MSE showed until now have a ver strange and suspicious convergence curve (always falling quickly around ~0.6 once scale up by 4).

My personal opinion it's just we should wait for the model to be able to get a realistic head convergence in a SL manner that show us the network has learn any minimal value strategy.

But as I say it's just my opinion, if you want to start already creating millions of self-training games, it's fine. We'll see in some months if we get something or not.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

I agree that without a value head that has at least some sensible output you can't even really be sure the thing is properly debugged.

For all you know your tree search has black and white scores reversed or something, you'd never notice.

I'm just saying that RL data has some better properties that make it easier to work with.

@Akababa
Copy link
Contributor

Akababa commented Jan 24, 2018

I really don't understand all the fuss about the value head. Why can't we just remove the color plane from the input?

@kiudee
Copy link
Contributor

kiudee commented Jan 24, 2018

@gcp I was processing your latest stockfish games and got an invalid game:

Processed 223910 gamesInvalid game in sftrain_clean.pgn/data

Scid also only reads 223k games.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

They're from pgn-extract so that would be strange. I uploaded a new one this morning, this loads fine for me in Scid (vs PC).

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

I've been looking at the training, and it looks like the policy and value heads are from the AlphaGo Zero paper, not the Alpha Zero paper? That would affect things! I'll file a separate issue.

@kiudee
Copy link
Contributor

kiudee commented Jan 24, 2018

Just to make sure that there were no network errors, I get the following checksum:

2386fc9cb17e88a0c0c06fce0bd50288  sftrain_clean.pgn/data

edit: Found the problem. The download using wget works and I get 352k games:

f1825d12c58d7ac04b93d80c79a62ea2  data

The problem was Firefox - it was (repeatedly) downloading an incomplete file. Any idea why?

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

Any idea why?

Initial download failed and the partially downloaded file was not deleted? (And, I think, then trying to resume after I had updated the file) If you check your downloads dir you might find .part file.

Downloading with Firefox works fine here FWIW.

@Error323
Copy link
Collaborator

@gcp what do you mean? I don't recall the AlphaZero paper talking about the NN architecture.

@gcp
Copy link
Contributor

gcp commented Jan 24, 2018

@Error323 They describe the policy output head. It needs to be quite different than from Go because Go moves only have to squares, which map nicely to a single board. But not so for chess. See issue #47 for more details.

It looks like here the policy output from Leela Zero was copied without making the required change.

@glinscott
Copy link
Owner

@Error323 has proven that it is indeed possible :). Congrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants