# Policy and value heads are from AlphaGo Zero, not Alpha Zero #47

Closed
opened this Issue Jan 24, 2018 · 71 comments

Projects
None yet
8 participants
Contributor

### gcp commented Jan 24, 2018 • edited

Line 366 in 09eb87f

The structure of these heads matches Leela Zero and the AlphaGo Zero paper, not the Alpha Zero paper.

The policy head convolves the last residual output (say 64 x 8 x 8) with a 1 x 1 into a 2 x 8 x 8 outputs, and then converts that with an FC layer into 1924 discrete outputs.

Given that 2 x 8 x 8 only has 128 possible elements that can fire, this seems like a catastrophic loss of information. I think it can actually only represent one from and one to square, so only the best move will be correct (and accuracy will look good, but not loss, and it can't reasonably represent MC probabilities over many moves).

In the AGZ paper they say: "We represent the policy π(a|s) by a 8 × 8 × 73 stack of planes encoding a probability distribution over 4,672 possible moves." Which is quite different.

They also say: "We also tried using a flat distribution over moves for chess and shogi; the final result was almost identical although training was slightly slower."

But note that for the above-mentioned reason it is almost certainly very suboptimal to construct the flat output from only 2 x 8 x 8 inputs. This works fine for Go because moves only have a to-square, but chess also has from-squares. 64 x 8 x 8 may be reasonable, if we forget about underpromotion (we probably can).

The value head has a similar problem: it convolves to a single 8 x 8 output, and then uses an FC layer to transform 64 outputs into...256 outputs. This does not really work either.

The value head isn't precisely described in the AZ paper, and a single 1 x 8 x 8 is probably good enough, but the 256 in the FC layer make no sense then. The problems the value layer has right now might have a lot to do with the fact that the input to the policy head is broken, so the residual stack must try to compensate this.

Contributor

### gcp commented Jan 24, 2018 • edited

 Thinking about it, the value head coming from a single 1 x 8 x 8 means it can only represent 64 evaluations. This would already be little for Go, where ahead or behind can be represented in stones. But for chess, where we often talk about centipawns, it's even worse.

Closed

Collaborator

### Error323 commented Jan 24, 2018 • edited

 OMG How did I miss this 😱. Kind of amazing it does what it does right now...
Contributor

### kiudee commented Jan 24, 2018

 I am also wondering now which kind of value head they used in AlphaZero, since it is not written in the paper. I will try a few architectures.
Collaborator

### Error323 commented Jan 24, 2018 • edited

 ~~~@gcp @kiudee I'm now gonna try the following: Going from 64 to 32 in both heads.~~~
Collaborator

### Error323 commented Jan 24, 2018

 @gcp According to the AlphaGoZero paper, the network also applies a batchnorm and ReLu in both heads. Why did you skip this? The output of the residual tower is passed into two separate ‘heads’ for computing the policy and value. The policy head applies the following modules: (1) A convolution of 2 filters of kernel size 1 ×​ 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer that outputs a vector of size 19 2 +​ 1 =​ 362, corresponding to logit probabilities for all intersections and the pass move The value head applies the following modules: (1) A convolution of 1 filter of kernel size 1 ×​ 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer to a hidden layer of size 256 (5) A rectifier nonlinearity (6) A fully connected linear layer to a scalar (7) A tanh nonlinearity outputting a scalar in the range [−​1, 1]
Contributor

### kiudee commented Jan 24, 2018

 We definitely need some kind of non linearity in the fully connected layer. Otherwise we will only ever be able to learn a linear function for each head. I also see no reason not to apply batch normalization, but that is not as crucial. F. Huizinga schrieb am Mi., 24. Jan. 2018, 20:22: … @gcp According to the AlphaGoZero paper, the network also applies a batchnorm and ReLu in both heads. Why did you skip this? The output of the residual tower is passed into two separate ‘heads’ for computing the policy and value. The policy head applies the following modules: (1) A convolution of 2 filters of kernel size 1 ×​ 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer that outputs a vector of size 19 2 +​ 1 =​ 362, corresponding to logit probabilities for all intersections and the pass move The value head applies the following modules: (1) A convolution of 1 filter of kernel size 1 ×​ 1 with stride 1 (2) Batch normalization (3) A rectifier nonlinearity (4) A fully connected linear layer to a hidden layer of size 256 (5) A rectifier nonlinearity (6) A fully connected linear layer to a scalar (7) A tanh nonlinearity outputting a scalar in the range [−​1, 1] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47 (comment)>, or mute the thread .
Collaborator

### Error323 commented Jan 24, 2018

 No never mind, I didn't read correctly. He does apply both BN and the ReLu, it's encoded in the `conv_block` function. Sorry.
Collaborator

### Error323 commented Jan 24, 2018

Ok, given the above I'm trying out two different networks:

### 1. NN 64x5

1. A convolution of 32 filters of kernel size 1 ×​ 1 with stride 1
2. Batch normalization
3. A rectifier nonlinearity
4. A fully connected linear layer that outputs a vector of size 1924

1. A convolution of 32 filters of kernel size 1 ×​ 1 with stride 1
2. Batch normalization
3. A rectifier nonlinearity
4. A fully connected linear layer to a hidden layer of size 256
5. A rectifier nonlinearity
6. A fully connected linear layer to a scalar
7. A tanh nonlinearity outputting a scalar in the range [−​1, 1]

### 2. NN 128x5

1. A convolution of 73 filters of kernel size 1 ×​ 1 with stride 1
2. Batch normalization
3. A rectifier nonlinearity
4. A fully connected linear layer that outputs a vector of size 1924

1. A convolution of 32 filters of kernel size 1 ×​ 1 with stride 1
2. Batch normalization
3. A rectifier nonlinearity
4. A fully connected linear layer to a hidden layer of size 256
5. A rectifier nonlinearity
6. A fully connected linear layer to a scalar
7. A tanh nonlinearity outputting a scalar in the range [−​1, 1]
Owner

### glinscott commented Jan 24, 2018

 Wow, great call @gcp. Glad you caught this before we kicked off the distributed learning process.
Collaborator

### Error323 commented Jan 25, 2018

 Using the above networks I'm reaching an accuracy of 45% and 52% respectively. There is a strange periodicity that I'm not sure about. I'm using a shufflebuffer of 2^18 and 16 prefetches. Thoughts?

### Zeta36 commented Jan 25, 2018

 I don't want to bother you again with the same thing, but your MSE plot is again near 0.15 that, scaled by 4, gives ~0.6 (you should stop scaling the MSE loss by the way, al least while you are studying the value head real convergence). That loss of around ~0.6 (given we are working with 3 integer outputs z=-1,0,1) means the NN learns nothing but a statistical mean response. I repeated this lot of times in the other post: In fact I did a toy experiment to confirm this. As I mention the NN was unable to improve after reaching 33% accuracy (~0.65 loss in mean square). And this has sense if the NN always return very near mean values. Imagine we introduce a dataset of 150 games: ~50 are -1, ~50 are 0 and ~50 are 1. If the NN learns for example to say always near 0, we get a loss of (mse): 100/150 ~ 0.66
Contributor

### kiudee commented Jan 25, 2018

 I also started training a network (on the stockfish data) which is at mse=0.09 (on the test set) after 10k steps. I will see if I observe the same periodicity.
Collaborator

### Error323 commented Jan 25, 2018

 @Zeta36 I know, I didn't want to alter the result outputs without having a merge into master. Otherwise comparing results is just more confusing.
Collaborator

### Error323 commented Jan 25, 2018

 @kiudee Did you also use a value head weight of 0.01?
Contributor

### kiudee commented Jan 25, 2018

 @Error323 Yes, but ignore my results for now. I think there was a problem with the chunks I generated. I will report back as soon as the chunks are generated (probably tomorrow).
Contributor

### kiudee commented Jan 26, 2018 • edited

 @Error323 Somewhere, there must be a bug. I re-chunked the Stockfish data, converted it to train/test and let it learn for a night and got the following (too good to be true and likely massively overfit) result: And yes, the network basically plays "random". I am using the following config: ```name: 'kb1-64x5' # ideally no spaces gpu: 0 # gpu id to process on dataset: num_samples: 352000 # nof samples to obtain train_ratio: 0.75 # trainingset ratio skip: 16 # skip every n +/- 1 pos input: './data/' # supports glob path: './data_out/' # output dir training: batch_size: 512 learning_rate: 0.1 decay_rate: 0.1 decay_step: 100000 policy_loss_weight: 1.0 value_loss_weight: 0.01 path: '/tmp/testnet' model: filters: 64 residual_blocks: 5```
Collaborator

### Error323 commented Jan 26, 2018

 Hmmm how did you generate the chunks and how many games are there? Given n chunks and skip size s you should generate (n*15000)/s samples. This makes sure you're sampling across the entire set of chunks. Over here things start to look promising @Zeta36 The new network (red) was forked from the orange and the loss weight on the value head set to 1. Learning rate is set to 0.001 until the 600K'th step, after which it will decay to 0.0001. We are now dropping well below the statistical mean response on the MSE.
Contributor

### jkiliani commented Jan 26, 2018

 @gcp is just implementing a validation split for a Tensorflow training code in Leela Zero (gcp/leela-zero#747), and it fixed a lot of problems for the reinforcement learning pipeline there. You might check out if any of this is relevant for you as well...
Contributor

### gcp commented Jan 26, 2018 • edited

 They already have this fix in their training code (the skip in the configuration), I think. As well as the validation splits. The training data generation was changed quite a bit to deal with the 10x increase in inputs and even faster network evaluation (8x8 vs 19x19) you get in chess.
Contributor

### kiudee commented Jan 26, 2018

 @Error323 I have 352000 games which result in 3354 chunks. So, from your equation I should set the number of samples to 3144375?

### Zeta36 commented Jan 26, 2018

 Yes, it looks promising, @Error323. Can you check the game level of that network looking for some kind of chess strategy learning? If you could compare the (strategy) game level of the orange NN against the red new one it'd great. If the red network really learned something beyond the mean I think we should already see it (probably it'll be easier to see this improvement after the opening phase is over -the policy head is really strong and determinant in the first dozen movements-).
Contributor

### kiudee commented Jan 26, 2018

 I found the bug that was causing the random play: ` innerproduct<32*width*height, NUM_VALUE_CHANNELS>(value_data, ip1_val_w, ip1_val_b, winrate_data);` I forgot to adjust the sizes of the inner products which failed silently. I think we should replace these magic constants by global constants.
Collaborator

### Error323 commented Jan 26, 2018 • edited

 I'm struggling with the same hah, crap. I fully agree on the magic constants to global. Any more places you're seeing problems? I'm using 73 output planes in both heads. I don't have segfaults anymore but am wondering whether I fixed everything correctly. https://gist.github.com/Error323/46a05ab5548eaeac95916ea428dd9dec @gcp @glinscott did I miss anything?
Owner

### glinscott commented Jan 26, 2018

 @Error323 one way to validate is to run it under asan. I do that with the debug build, and `-fsanitize=address` passed to both compiler and linker. Your changes look correct to me though.
Contributor

### kiudee commented Jan 26, 2018

 @Error323 Looks good. After adjusting those places leela-chess started to play sane chess for me.
Collaborator

### Error323 commented Jan 26, 2018 • edited

 Ok so I'm still running an evaluation tournament of "orange" vs "red" in the graph above. Orange having higher accuracy and higher MSE. Red having lower accuracy but also lower MSE. So far the score is in favor of Orange with `23 - 43 - 25`. Games can be found here: https://paste.ubuntu.com/26466921/ I know it's only 800 playouts, but I still think this is inferior play given the network and the dataset. Both still just give away material or fail to capture important material for the taking. So I don't think we're ready for self play yet. The networks need to be better. Currently trying out a 64x5 network and 128x5 with 0.1 loss weight on the value head. Value head is 8x8x8 -> 64 -> 1 and policy head is 32x8x8 -> 1924. Let me know if you have different ideas/approaches.
Owner

### glinscott commented Jan 27, 2018

 @Error323 interesting! What happens if you take a normal midgame position, run the network on it, and then eg. remove the queen, and re-run. Does the win percentage change?
Collaborator

### Error323 commented Jan 27, 2018

 @glinscott Using the midgame from @kiudee it does seem to be well aware of the queen. ``````position fen 8/4R2p/6pk/5p2/8/4P1KP/1r3PP1/8 w - - 0 48 go eval=0.669482, 801 visits, 11247 nodes, 801 playouts, 468 n/s bestmove e7f7 position fen Q7/4R2p/6pk/5p2/8/4P1KP/1r3PP1/8 w - - 0 48 go eval=0.990093, 801 visits, 16811 nodes, 801 playouts, 479 n/s bestmove a8a7 `````` And the same initial FEN string with 16K playouts: ``````Playouts: 15654, Win: 75.49%, PV: e7c7 b2b1 g3f3 b1f1 g2g4 f5g4 h3g4 h6g5 f3g2 g5g4 g2f1 g4f3 c7h7 g6g5 h7h6 g5g4 eval=0.669482, 16001 visits, 239960 nodes, 16001 playouts, 443 n/s bestmove e7c7 ``````
Collaborator

### Error323 commented Jan 27, 2018

 I think at this point there are two valid approaches. Make sure we obtain a good functioning network through supervised learning. This is partially uncharted territory as DeepMind has not released specifics on their network, nor attempted to perform supervised learning with chess. From results thusfar it seems like it's a delicate balance between the hyperparameters (loss weights, learning rate, decay function). Also, how do we define good? When are we confident enough? Go for the reinforcement learning selfplay approach on a small network i.e. 64x5. As we don't have a dedicated datacenter with a steady rate we should brew our own learning-rate-decay function that observes the gradients of the losses and alters the learning rate accordingly. We could look into further optimizations for faster game simulations, e.g. TensorRT for NVIDIA cards, this might produce a significant boost in performance with the INT8 operators, did you look into this yet @gcp? Either way, with both approaches it might be good to build the learning-rate-decay function. @glinscott shall I give that a try?
Collaborator

### Error323 commented Jan 29, 2018

 I'm getting closer and closer to having the client and server ready for the self-learning run, I have the DB schema mostly set up, and starting to write some tests for it. It's a golang + postgresql backend, pretty standard stuff, so shouldn't take too much longer. Nice! I think we should also start to focus more on the C++ side again now. The nodes per second needs to go up. I'm experimenting with caffe and TensorRT in #52 and I definitely think that the heads should go to the GPU as you suggested in #51. Before we launch the self-learning we should be as fast as possible to utililize all those precious cycles as best we can :) I was wondering what kind of GPU's people here have? Also we need Windows testers.
Contributor

### gcp commented Jan 29, 2018 • edited

 We could look into further optimizations for faster game simulations, e.g. TensorRT for NVIDIA cards, this might produce a significant boost in performance with the INT8 operators, did you look into this yet @gcp? The problem of TensorRT is that it requires optimization to the specific weights of the network. It's OK once you have a trained network, but not if you're still training it or have variable weights. (I guess the question is if you could pre-transform the networks on the server machine - that might work) INT8 support is exclusive to the 1080 GTX (Ti or not I'm not sure?) only, AFAIK. Anything that depends on cuDNN also requires the end user to make an account on NVIDIA's site and download themselves due to licensing restrictions. I don't know how TensorRT's license looks. It's also NVIDIA-only. I try to stay as far away as possible from vendor lock in. But some people have made a version of Leela Zero that does the network eval by a TCP server and have that calculated by Theano using cuDNN, for example. If you don't care for end-user setup complexities more things are possible.
Contributor

### gcp commented Jan 29, 2018 • edited

 @Error323 this is fantastic! I noticed that the speed went way down when I tested out a 64 channel network, although it should be less of a hit to move to 32. I'm thinking we probably have to move that part off to the GPU portion now, which shouldn't be too difficult. There are implementations of this in Leela Zero's pull requests. (They were not merged because for Go there was no gain) gcp/leela-zero#185
Collaborator

### Error323 commented Jan 29, 2018

 The problem of TensorRT is that it requires optimization to the specific weights of the network. It's OK once you have a trained network, but not if you're still training it or have variable weights. (I guess the question is if you could pre-transform the networks on the server machine - that might work) Indeed, that was the thinking. Pretransform once per new network and deploy across all nv based workers. INT8 support is exclusive to the 1080 GTX (Ti or not I'm not sure?) only, AFAIK. I believe it's also available on latest titans and their Volta architecture. Maybe 1080 too? Not sure about that though. Anything that depends on cuDNN also requires the end user to make an account on NVIDIA's site and download themselves due to licensing restrictions. I don't know how TensorRT's license looks. This fact reaaaaaaaaaaaally sucks :( It's also NVIDIA-only. I try to stay as far away as possible from vendor lock in. I understand indeed, this is good. I just want to squeeze every drop of performance out of my nv cards. But some people have made a version of Leela Zero that does the network eval by a TCP server and have that calculated by Theano using cuDNN, for example. If you don't care for end-user setup complexities more things are possible. This sounds very inefficient?
Collaborator

### Error323 commented Jan 30, 2018

After training for 2 days and 20 hours, the network is done. Below you can see the tensorboard graphs.

In order to see if training for so long helped, I did some quick experiments with various networks against gnuchess tc=30/1:

Steps Playouts Rounds lczero vs gnuchess
796K 40 10 0 - 9 - 1
296K 4000 10 3 - 6 - 1
796K 4000 10 8 - 1 - 1

kbb-net.zip contains the weights, config yaml file and results. With your permission @glinscott I'd like to suggest we hereby close #20.

Contributor

### jkiliani commented Jan 30, 2018

 What's the Elo rating for Gnuchess at the settings you used?
Contributor

### kiudee commented Jan 30, 2018

 @Error323 Good job. I am going to try playing that network. I am also currently running training on the Stockfish data with a 10x128 network: It’s using the following parameters: Value loss weight 0.02 Policy head: 64x8x8 Value head: 32x8x8 -> 256 For now (at iteration 220k) it still plays quite bad, but through search it is typically able to improve its moves, which suggests that the learned value head is providing some benefit.
Owner

### glinscott commented Jan 30, 2018

 @Error323 congrats! That's awesome progress :). One thing I'm curious about - which version of GnuChess did you use?
Collaborator

### Error323 commented Jan 30, 2018

 It's 6.2.2 the default in Ubuntu 16.04. I don't know the ELO rating. Note that I handicapped it severely with 30 moves per minute.
Collaborator

### Error323 commented Jan 31, 2018

 @glinscott that site is a goldmine for supervised learning 😍 The PGN files seem to be available.
Contributor

### kiudee commented Jan 31, 2018

 I am aborting my training run now. This is the state after 400k steps: Even though it has an accuracy of almost 80% on the stockfish data, it still has problems finding good moves with a low number of playouts. I suspect there might be a problem of variety in the Stockfish data, causing it to have similar positions in train/test. The learned value function is able to compare positions relatively (which is why it improves its move choice when letting it do more playouts), but the absolute score is always hovering around 55% for white. It looks like a local optimum where it learned to adjust the score slightly based on the position. If you want I can upload weights and config, but the file is too large for attaching it here.
Collaborator

### Error323 commented Jan 31, 2018 • edited

 Even though it has an accuracy of almost 80% on the stockfish data, it still has problems finding good moves with a low number of playouts. I suspect there might be a problem of variety in the Stockfish data, causing it to have similar positions in train/test. I agree. I think that even though the final gamestates may be different, the first n ply share many similarities. Given your graphs it should be annihilating everything, but the data compared to the entire chess gametree is too sparse.
Contributor

### kiudee commented Jan 31, 2018

 Just to get an idea, this is the PV of the start position including the points where it changed its mind: ``````Playouts: 1412, Win: 55.40%, PV: g1h3 h7h6 e2e4 e7e6 c2c3 g7g6 f1b5 Playouts: 2908, Win: 55.41%, PV: g1h3 h7h6 e2e4 b8c6 f1a6 g7g6 d1f3 g8f6 Playouts: 4405, Win: 55.35%, PV: g1h3 h7h6 e2e4 e7e5 c2c3 c7c6 f1a6 Playouts: 5912, Win: 55.29%, PV: g1h3 h7h6 e2e4 e7e5 c2c3 f7f6 f1c4 Playouts: 11568, Win: 55.25%, PV: g1h3 h7h6 e2e4 d7d5 f1b5 b8c6 d1f3 g7g6 d2d3 g8f6 Playouts: 37993, Win: 55.64%, PV: e2e4 b8c6 g1h3 h7h6 f1a6 g7g6 Playouts: 110811, Win: 56.01%, PV: e2e4 d7d5 e4e5 d5d4 g1h3 d4d3 f1e2 h7h6 c2c3 c8d7 e2f3 Playouts: 498938, Win: 56.21%, PV: e2e4 e7e5 g1f3 f8c5 d2d4 c5d4 c1g5 h7h6 h2h4 d4e3 d1d4 e3c1 g2g3 c1e3 d4e3 g7g6 e3d3 eval=0.562360, 500001 visits, 15851401 nodes, 500001 playouts, 563 n/s bestmove e2e4 ``````
Contributor

### kiudee commented Jan 31, 2018

 I found something interesting. I gave the network trained by @Error323 the following position from the 10th game of AlphaZero vs Stockfish: ``````rnb2r2/p3bpkp/1ppq3N/6p1/Q7/2P3P1/P4PBP/R1B2RK1 w - - 0 19 `````` After 325950 playouts the network prefers the move Re1: ``````... Playouts: 324241, Win: 56.87%, PV: h6f7 f8f7 c1e3 e7f6 a1d1 d6e7 e3d4 b6b5 a4c2 f6d4 c3d4 c8g4 d4d5 e7f6 h2h3 g4f3 g2f3 f6f3 f1e1 c6d5 d1d3 f3f5 c2d2 d5d4 d3d4 b8c6 d4d5 f5f6 d2g5 f6g5 d5g5 g7f8 e1e4 a7a6 g5c5 a8c8 g1g2 f7f6 f2f4 Playouts: 325950, Win: 56.85%, PV: f1e1 c8e6 c1e3 g7h6 h2h4 b8d7 a1d1 d7c5 h4g5 h6g6 a4h4 d6d1 e1d1 h7h5 g2e4 g6g7 e4f3 f8h8 h4d4 g7g6 d4e5 h8h7 ... Playouts: 864489, Win: 58.53%, PV: f1e1 c8e6 c1e3 g7h6 h2h4 b8d7 a1d1 d7c5 h4g5 h6g7 a4h4 d6d1 e1d1 f8d8 e3d4 g7g8 h4h6 e7f8 h6f6 d8d4 d1d4 a8c8 d4d8 c5d7 f6e6 c8d8 e6c6 d7e5 c6c7 d8d1 g2f1 e5f3 g1g2 f3g5 c7a7 f8c5 f1e2 d1d2 a7a8 g8g7 g2f1 d2a2 f2f4 ``````
Collaborator

### Error323 commented Jan 31, 2018

 @kiudee it's not immediately obvious to me. Could you elaborate?
Contributor

### kiudee commented Jan 31, 2018 • edited

 If you analyze the position with Stockfish, it thinks the move is losing/even (depending on depth). Yet, in the game it is soon apparent that white is winning. Your network also seems to prefer this move after some time. This does not mean that the network has the same deep understanding of the position that AlphaZero has, but it is intriguing. Stockfish at depth 42: ``````42 [-0.54] 19.... Kxh6 20.h4 f6 21.Rxe7 Qxe7 22.Ba3 c5 23.Bxa8 Kg7 24.hxg5 fxg5 25.Bg2 Bf5 26.c4 Kh6 27.Rd1 Bg4 28.Rf1 Qd7 29.Qc2 Nc6 30.Bxc6 Qxc6 31.f4 gxf4 32.Bc1 Qe6 33.Rxf4 Rxf4 34.Bxf4+ Kg7 35.Kf2 Qf6 36.a3 Be6 37.Qe4 Qb2+ 38.Kg1 Qa1+ 39.Kh2 Qa2+ 40.Kg1 Qxc4 41.Qxc4 Bxc4 42.Bb8 a6 43.Ba7 b5 44.Bxc5 a5 45.Bd4+ Kg6 46.Kf2 (1176.08) ``````

### lp-- commented Jan 31, 2018

 Puzzling it does not consider taking knight at all ``````Detecting residual layers...Loading kbb-net/kbb1-64x6-796000.txt v1...64 channels...6 blocks. position fen r2qk2r/p2nppb1/2pp1np1/1p2P1Bp/3P2bP/2N2N2/PPPQ1PP1/2KR1B1R w kq - 1 10 go Playouts: 304, Win: 58.96%, PV: f1e2 d6e5 d4e5 f6h7 g5f4 d8a5 c1b1 d7c5 f3d4 ... Playouts: 898, Win: 60.12%, PV: d1e1 d6e5 d4e5 f6h7 g5h6 g7h6 d2h6 g4f3 g2f3 d8a5 h6g7 ... Playouts: 3337, Win: 59.96%, PV: d4d5 d7e5 f3e5 d6e5 f2f3 g4d7 d5c6 d7c6 f1b5 c6b5 c3b5 d8d2 d1d2 e8g8 h1e1 f8c8 ``````

### lp-- commented Jan 31, 2018

 Here it decides for no reason to give up knight. ``````Detecting residual layers...Loading kbb-net/kbb1-64x6-796000.txt v1...64 channels...6 blocks. position fen r2qk1nr/p2nppb1/2pp2p1/1p2P1Bp/3P2bP/2N2N2/PPPQ1PP1/2KR1B1R b kq - 0 9 go Playouts: 304, Win: 39.18%, PV: d6e5 d4d5 g4f3 g2f3 b5b4 c3e4 d8b6 d5d6 g8f6 ... Playouts: 2157, Win: 37.97%, PV: g8f6 f1e2 b5b4 e5f6 b4c3 d2c3 d7f6 c3c6 g4d7 c6c4 a8c8 c4b3 e8g8 `````` But stockfish thinks that black is even better with d6e5 `info depth 20 seldepth 32 multipv 1 score cp 52 nodes 2709905 nps 1259249 hashfull 872 tbhits 0 time 2152 pv d6e5 d4e5 d7e5 f3e5 d8d2 d1d2 g7e5 f1d3 g8f6 h1e1 f6d7 f2f4 e5f6 g2g3 a7a6 c3e4 f6g5 e4g5 d7f6 d3e4 f6e4 g5e4 e8f8 e4c5 a8a7 d2d8 f8g7 d8h8 g7h8 e1e3 h8g7`

### lp-- commented Jan 31, 2018

 Positions are from the full game it played with 1600 playouts ``````1. e4 g6 2. d4 Bg7 3. Nc3 d6 4. h4 h5 5. Bg5 c6 6. Qd2 b5 7. Nf3 Bg4 8. O-O-O Nd7 9. e5 Ngf6 10. d5 Nxe5 11. Nxe5 dxe5 12. dxc6 Qa5 13. f3 Be6 14. Nxb5 Qb6 15. Bc4 Qxc6 16. Bxe6 fxe6 17. Rhe1 Qxb5 18. Qd3 Qxd3 19. Rxd3 Kf7 20. Rxe5 Rhc8 21. Rde3 Rc6 22. c3 Rb8 23. Ra5 a6 24. Bxf6 Bxf6 25. Kc2 Bxh4 26. Re4 Bf6 27. Rf4 h4 28. Raa4 g5 29. Rfb4 Rd8 30. Re4 Rd5 31. Re2 a5 32. Rd2 Rcc5 33. Re2 Rc6 34. Rd2 Re5 35. Rdd4 Rd6 36. Re4 Red5 37. Re2 Be5 38. Rae4 Bf4 39. a4 e5 40. b4 Kf6 41. Kb3 Kf5 42. Kc4 g4 43. fxg4+ Kxg4 44. bxa5 Rxa5 45. Kb4 Ra8 46. Rxe5 Bxe5 47. Rxe5 Kg3 48. Rg5+ Kf4 49. Rg7 e5 50. a5 e4 51. Kc5 Rd2 52. Kb6 Rb2+ 53. Kc6 Rxa5 54. c4 e3 55. c5 Rc2 56. Kd6 Raxc5 57. Re7 Rg5 58. Rh7 e2 59. Rxh4+ Kg3 60. Re4 Kxg2 61. Re3 Kf2 62. Re8 Rd2+ 63. Kc7 Rc2+ 64. Kd6 Rd2+ 65. Kc7 Rd4 66. Rf8+ Ke3 67. Re8+ Kf2 68. Rf8+ Ke3 69. Re8+ Kd2 70. Rd8 Rxd8 71. Kxd8 e1=Q 72. Kd7 Rg7+ 73. Kd6 Qd1 74. Ke5 Qf3 75. Kd4 Qf4+ Score: 0 ``````

### amj commented Feb 6, 2018

 @Error323 Scanning through this, the periodicity in the earlier graphs is almost certainly an artifact of the shuffling behavior. We were rather constantly surprised by the shuffle behavior not quite doing what we expected. We ended up turning the shuffle size up to almost the total number of steps we take per generation (2M)... not quite the same as what you're doing w/ SL here but that periodicity really, really jumped out at me.

### amj commented Feb 6, 2018

 @Error323 re: how to see when to drop the learning rate. It was suggested to us that we monitor the gradient updates as a way to see when it's time to drop the rate, and adjust accordingly.
Collaborator

### Error323 commented Feb 7, 2018

 We were rather constantly surprised by the shuffle behavior not quite doing what we expected. We ended up turning the shuffle size up to almost the total number of steps we take per generation (2M)... Well this is somewhat scary. When using a shuffle buffer of 2^18 my memory consumption already went through the roof. how to see when to drop the learning rate. It was suggested to us that we monitor the gradient updates as a way to see when it's time to drop the rate, and adjust accordingly. Thanks! This is very different from what I had in mind, and probably better! I'm gonna examine code and comments!
Contributor

### gcp commented Feb 7, 2018 • edited

 Well this is somewhat scary. When using a shuffle buffer of 2^18 my memory consumption already went through the roof. In Leela Zero this was solved by dropping 15/16th of the training data randomly. This trades of 16x the input processing CPU usage for a 16 times saving of memory. You can make this bigger if you have more or faster cores in the training machine. Whether this is usable somewhat depends on the input pipeline. But yes, shuffle buffer by themselves aren't good enough if the training data is sequential positions from games.

### amj commented Feb 7, 2018

 we dropped data as well, sampling only 5% of the positions (here). This is for RL though not SL -- i haven't done much with SL

Closed

Owner

### glinscott commented Mar 4, 2018

 Fixed now :).

Open

Contributor

### gcp commented Dec 6, 2018

 The value head has a similar problem: it convolves to a single 8 x 8 output, and then uses an FC layer to transform 64 outputs into...256 outputs. This does not really work either. The value head isn't precisely described in the AZ paper, and a single 1 x 8 x 8 is probably good enough, but the 256 in the FC layer make no sense then. Turns out that is actually what they did. From empirical experiments, doing the 1x1 256->1 down-convolution before the FC layers works very well (and better than doing, say, 1x1 or 3x3 256->32 and using that as FC input). That said, making the value head FC layer bigger(!) than the input to it still seems strange to me and looks like it was carried over from the other games more than anything else.

Open