net2net #704

Ttl · 2018-01-21T08:51:30Z

Implementation of net2net (https://arxiv.org/abs/1511.05641) for leelaz network.

Adding blocks works, but widening has still some bugs and the output of the widened network isn't exactly equal to the old one. It's still pretty close and it should be a better initialization than random even now.

Just putting this here in case anyone else was working on this. I'll investigate the widening later.

Example usage:

python net2net.py 1 64 <network> adds 1 block and 64 filters to the network. Output is written to <network>_net2net.

Pull request leela-zero#682.

Widening has still some bugs and the output doesn't match. Adding more blocks works.

isty2e · 2018-01-21T10:19:37Z

training/tf/net2net.py

+        return weights, next_weights
+
+    rand = range(channels)
+    rand.extend(np.random.randint(0, channels, new_channels))


In python 3, ranges cannot be directly extended:

AttributeError: 'range' object has no attribute 'extend'

Changing rand = range(channels) to rand = list(range(channels)), it works fine for me.

isty2e · 2018-01-21T10:22:51Z

I tested with the latest best net, adding a block and 128 channels, and it does not undergo any verification problem. In "the output of the widened network isn't exactly equal to the old one", do you mean the deviation under 1e-6?

Ttl · 2018-01-21T10:26:06Z

Verification in the program works. However if you try the widened net in leelaz it doesn't give the same heatmap.

isty2e · 2018-01-21T19:37:00Z

Do you have any slightest clue what is the problem here? Having read the code, I still don't see why the heatmap and NN evals are different.

Ttl · 2018-01-21T19:44:01Z

As you can see the included verification code passes. I have no idea why the heatmaps are different.

The verification is calculated without batch norms and residual connections so maybe something related to those? Although I believe I'm processing the batch norm weights correctly and the residual connection shouldn't affect the widening. Or maybe there's a bug with the text processing somewhere?

isty2e · 2018-01-21T19:50:48Z

As far as I have tested, text processing and I/O seem okay, which was tested by adding 0 block and 0 channel, and removing if new_channels == 0 in conv_bn_wider(). Also after adding blocks and filters, the first values of weights were appropriately devided by rep_factor, and I don't think there is anything wrong with the batchnorm layer too. Can there anything with the last layers?

Hersmunch · 2018-01-21T20:25:31Z

training/tf/net2net.py

+
+        for i in range(len(rand)):
+            noise = 0
+            if i > channels:


Just reading through to learn/try to understand. Might be wrong but should this be >= ?

Good catch. It should be >=.

Hersmunch · 2018-01-21T21:21:26Z

Does the order of deepening and widening matter? Has the order of deepen then widen been chosen for a particular reason?
From what I read I don't think it matters but perhaps it is cheaper/simpler to widen then deepen since the deepening appears to be cheaper. I think it also would mean that noise won't be applied twice to the new widened filters in the new blocks. Does that make sense?

Ttl · 2018-01-22T03:28:28Z

It was easier to do it this way since widening also modifies the next layer inputs. New channels in the new blocks will have more noise, but I'm not worried about that.

bood · 2018-01-22T03:38:05Z

A noob question: if this works correctly, are we expecting a 6b network with identical outputs as the original 5b network? Any drawbacks compared to supervised 6b version currently used?

…

On Mon, Jan 22, 2018 at 11:28 AM, Henrik Forstén ***@***.***> wrote: It was easier to do it this way since widening also modifies the next layer inputs. New channels in the new blocks will have more noise, but I'm not worried about that. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#704 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHWLrwctQ7X4a7pAU6P7_sEe7n1Q-cfks5tNABggaJpZM4Rlvqr> .

Ttl · 2018-01-22T03:42:22Z

The modified network should have identical outputs to the old one. It's meant to be used as initialization for the training. Training should converge much quicker when output is already almost correct compared to starting from completely random network.

bood · 2018-01-22T03:46:01Z

Current used 6b version is not random either, but the problem seems to be value network (which is overfit), so we are now basically compensating the value network lose during the 5b -> 6b conversion. I mean if this works out, there should be no such problem and can save us a lot of time I guess?

…

On Mon, Jan 22, 2018 at 11:42 AM, Henrik Forstén ***@***.***> wrote: The modified network should have identical outputs to the old one. It's meant to be used as initialization for the training. Training should converge much quicker when output is already almost correct compared to starting from completely random network. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#704 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHWLr-MUUdDnZ9_L7QAphsnryVqDhIKks5tNAOigaJpZM4Rlvqr> .

Ttl · 2018-01-22T04:05:22Z

Training of the 6 block network was started from random initialization. There aren't other alternatives right now. Don't confuse random weight initialization with self-play of random network.

This doesn't solve whatever problem is causing the value network to be overfit. Network initialized with this method still needs to be trained and it would end up being overfit in the same way.

bood · 2018-01-22T04:28:06Z

Thanks for clarify. I get your point about 6b net being trained from random. My questions is actually about the converted 6b net from 5b net. I thought the output 6b network could be used for self-play too, no? The first 6b net playing self-play (ed0/7fd) is supervised training from last 5b net (c83), but ed0/7fd appears to be weak in its value network because of overfitting issue (I'm not so unsure though). My question is if net2net is able to create an identical 6b network from c83, we can use it to start self-play and it seems it will have a better value network than ed0/7fd thus save us a lot of time. Correct me if anything I said is not true.

…

On Mon, Jan 22, 2018 at 12:05 PM, Henrik Forstén ***@***.***> wrote: Training of the 6 block network was started from random initialization. There aren't other alternatives right now. Don't confuse random weight initialization with self-play of random network. This doesn't solve whatever problem is causing the value network to be overfit. Network initialized with this method still needs to be trained and it would end up being overfit in the same way. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#704 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHWLgZcPdTwWHBElJlx2xjMID0Ec6VLks5tNAkGgaJpZM4Rlvqr> .

roy7 · 2018-01-22T04:57:52Z

I get @bood's point and I assume he's correct. If we extend the best 5x64 into a 6x128 in a way that has the same results for all inputs, it's simply larger, it should basically just keep going from where 5x64 is currently without the artifacts a supervised training 6x128 has had.

If nothing else, if we can extend the network successfully, training could be tried with both 6x128 networks to see which can promote to a better new network sooner. And it'd also be a success to build on later for 7x128 or 6x256 or whatever once the technique is proven.

isty2e · 2018-01-22T05:06:40Z

My opinion is that it is unclear if there is any overfitting in the current 6-block networks since there is no evidence to support it, and if any, it can be cured by the self-play reinforcement learning. At any rate, this provides a safe and fast connection from a small net to a larger one, hence the bootstrapping is much easier and reliable. Of course, the first training cycle after bootstrapping probably will take more time compared with a regular cycle.

jkiliani · 2018-01-22T05:53:37Z

Has any play testing been done yet how the enlarged net does compared to the base version when matched against each other?

bood · 2018-01-22T06:08:07Z

I think this still has some implementation problem, heatmap of the generated 6b network is not the same as the original 5b network.

…

On Mon, Jan 22, 2018 at 1:53 PM, jkiliani ***@***.***> wrote: Has any play testing been done yet how the enlarged net does compared to the base version when matched against each other? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#704 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHWLugeQEb4o1fxIxKaOVX1P2Ek98lRks5tNCJlgaJpZM4Rlvqr> .

isty2e · 2018-01-22T06:11:28Z

Yes, due to the issue @bood mentioned it is not complete yet. Adding blocks does not matter, but upon addition of filters the heatmap and eval are somehow distorted, so it is not the time for actual match-tests yet. Like I said, I could not find why that happened, so if anyone is interested, please take a look and figure out why.

bood · 2018-01-22T06:22:08Z

My opinion is that it is unclear if there is any overfitting in the current 6-block networks since there is no evidence to support it, and if any, it can be cured by the self-play reinforcement learning. At any rate, this provides a safe and fast connection from a small net to a larger one, hence the bootstrapping is much easier and reliable. Of course, the first training cycle after bootstrapping probably will take more time compared with a regular cycle.

Yes, it's maybe not due to overfitting, I don't what it should be called, but it seems that supervise trained 6b would lose some information. e.g. c83 knows quite a bit of ladder but ed0/7fd seems to be weak on ladder since c83 plays little ladder so ed0/7fd has nothing to learn from. Just like the weights.txt we get from human games.

On the other hand, if we have an identical 6b from 5b, we even don't have any bootstrapping problem! we can continue to include those self-play games generated from c83 in the training window, since in theory this 6b would just generate the same games as c83.

isty2e · 2018-01-22T06:36:28Z

@bood Due to the difference in the training set, the skills they have learned should be different, so it will be eventually handled once we are on track again. And even using this net2net method, practically a larger net is expected to behave differently, since we have to add random noise to break the symmetry in weights to avoid some neurons to be exactly identical. Thus there should be some training before replacing the best net.

Still, this can save a lot of time, since when we start with a larger net, we do not have to start from a random network and train it with all the games played historically, but instead start with a net2net version and train with a given training window.

Mardak · 2018-01-22T07:10:06Z

Does it make sense to try training a deeper net first? E.g., 5x64 -> 10x64?

$ ./net2net.py 5 0 c83e1b6e0ffbf8e684f2d8f6261853f14c553b29ee0e70ff6c34e87d28009c43
Version 1
Channels 64
Blocks 5
Processing block 1
Processing block 2
Processing block 3
Processing block 4
Processing block 5
Processing block 6
Processing block 7
Processing block 8
Processing block 9
Processing block 10

$ ./leelaz -w c83e1b6e0ffbf8e684f2d8f6261853f14c553b29ee0e70ff6c34e87d28009c43_net2net
…
Detecting residual layers...v1...64 channels...10 blocks.
…
Leela: heatmap
  0   0   1   1   1   1   1   0   0   0   0   0   0   0   0   0   1   1   1 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0 
  0   1   3   5   1   0   1   1   1   1   1   1   1   0   1   7   3   1   0 
  0   1   5 144   3   1   1   1   1   1   1   1   1   0   2 133   6   1   0 
  0   1   1   3   1   1   1   1   1   1   1   1   1   1   1   3   1   1   0 
  0   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   1   0   0   1   1   1   1   1   1   1   1   1   1   1   0   0   1   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   0 
  0   1   5 134   2   0   1   1   1   1   1   1   1   0   2 127   5   1   0 
  0   1   3   6   1   0   0   1   1   1   1   1   1   0   1   5   3   1   0 
  0   1   1   1   1   1   1   0   1   0   1   0   1   1   1   1   1   1   0 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
pass: 0
winrate: 0.440024

Ttl · 2018-01-22T16:03:10Z

The non-matching output is caused by the residual connection. The first convolution has some channels duplicated and the next layer's inputs for the duplicated channels are divided by the duplication factor causing the output to be the same as before. Residual connection however messes this up since it doesn't include the division by the duplication factor.

To confirm it I commented the residual connection out in the leelaz source code and the output matches.

However I'm not sure what would be the best way to fix it. This might be the closest it gets.

roy7 · 2018-01-23T16:25:56Z

Yes no worries. Thanks for your feedback. This is just a one time thing to assure myself net2net was working as expected.

roy7 · 2018-01-23T17:52:25Z

Final results of -p 50 5x64 vs 6x128 net2net when I stopped it, 265 wins, 246 losses.

alreadydone · 2018-01-24T02:27:27Z

Forgive my obsession in math but I came up with another randomization. I suspect that the existing randomizations already make the enlarged networks trainable though, even without noise (all weights should already be distinct, though I don't know whether this suffices), and we probably don't need to widen anymore unless we transition from 128 filters to 196 or 256.

The idea is that w(j) can be chosen independently for each output channel h and each spot in the 3x3 convolutional kernel. I didn't point out that w(j) can also be allowed to depend on h, though this is a trivial observation and maybe already implemented? Even negative values of w(j) can be allowed, though I don't know whether it's advantageous to keep it close to 1/c_j so as not to affect the variance of the weights.

Since the layers are 3x3 convolutional, the q inputs are effectively 9q inputs, and we label the r-th entry in the j-th input channel by {j,r} (r=1,2,...9). Then for any function w(j,r,h) satisfying
Σ_{x such that g(x)=g(j)}w(x,r,h)=1 for all j,r,h,
we can let
U⁽ⁱ⁺¹⁾_{j,r},h=w(j,r,h)W⁽ⁱ⁺¹⁾_{g(j),r},h
(cf. page 4 of paper)

(Just to clarify, the skip connection does not need to be divided by anything. The inputs/outputs of each layer of the enlarged network are supposed to be the same as the original network, only replicated into (different numbers of) copies.)

alreadydone · 2018-01-24T07:24:30Z

What I said above is not completely correct; it only works if we are widening but not deepening. If we are only deepening but not widening, it seems inevitable to add noise to get rid of the completely 0 weights in the added block. But if we deepen and then widen, then the randomizations above apply, and we can expect a trainable enlarged network without noise (hence with identical output). The formula needs to be modified though.

First of all, to ensure that we get rid of all of the zeros, g(j) must hit each number in 1,2,..64 at least twice, so we must use g(j)=j for 1≤j≤64 and g(j)=j-64 for 65≤i≤128 (or something equivalent) not only for the inputs/outputs of the blocks but also for the intermediate between the two convolutions. To ensure that all weights are distinct, we pick distinct numbers w(j,r,h) for 1≤j≤64, 1≤r≤9 and 1≤h≤(number of output channels at that step). We then simply take
U⁽ⁱ⁺¹⁾_{j,r},h=w(j,r,h)
U⁽ⁱ⁺¹⁾_{j+64,r},h=-w(j,r,h)
These should ensure that the added layer(s) have distinct weight.

(If noise is to be added, I think the new block should be added at the bottom (closest to the final output layer) to minimize accumulated errors.)

Ttl · 2018-01-24T15:31:40Z

@alreadydone Interesting, but I think the current implementation is good enough as is. I don't think I'm going to implement it, but if you want you can code it yourself and submit a pull request.

Pull request leela-zero#704.

alreadydone · 2018-02-20T08:37:37Z

After the 6x128 network stalls and we use net2net to migrate to 10x128, I wonder whether it's a good idea to freeze the old blocks in training. This seems to be a common technique in transfer learning and a good way to make sure the new network doesn't forget too much when subjected to training even with a high learning rate; the feature extracted by the 6-block network should be useful to extract further features anyway. (I would add the 4 new blocks near the output and only allow weights in these blocks and in the policy/value heads to vary.) Also this helps to reduce the training time. When 10x128 stalls we could unfreeze to squeeze out the last bit of potential.

I wonder whether this strategy works for transferring knowledge between different komi and board sizes. It seems we could just copy all the weights in the residual blocks, freeze all (except maybe the last few blocks), and only train the fully-connected layers from random initialization; the quality of the self-play games should improve faster/more steadily than if the blocks are not frozen. If anyone tries this please let me know about the result.

Reference: https://www.analyticsvidhya.com/blog/2017/06/transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/ Section "Ways to Fine tune the model"

Ttl · 2018-02-20T13:35:14Z

I tried training a 8x128 network using net2net before and it seemed to work just fine without freezing weights. After 351k steps using 128 batch size I got the following result in validation:

101 wins, 75 losses
176 games played.
Status: 0 LLR 1.73266 Lower Bound -2.94444 Upper Bound 2.94444

Not statistically significant yet, but seems to be doing well considering this was the first network I tested and normally finding a better network seems to require testing many networks.

The previous issues with bootstrapping were probably caused by the shuffling bug.

bood · 2018-02-20T15:34:59Z

@Ttl Are you using 5x64 network c83 self-plays to do the training? What is the learning rate are you using? The net2net paper seems to suggest using a 1/10 learning rate for the new generated network?

I agree that we should do some testing on net2net before 6x128 network stalls.

Ttl · 2018-02-20T16:00:23Z

I started with 63498 network. Learning rate was 0.005 for 150k steps and after that I dropped it to 0.0005.

bood · 2018-02-20T16:05:12Z

I see. But for batch size 128, shouldn't it be using 0.0025 and 0.00025 to match the current setting 512/0.005? And how you decide the number 150k as the large learning rate run steps?

Ttl · 2018-02-20T16:11:53Z

Yes, if you want to match to the current parameters.

bood · 2018-02-22T12:57:00Z

Also tried net2net from 5x64 to 6x128. Trained with c83 self-plays.

Parameters:
Batch size: 128, Learning rate: 0.0025 for the first 152k moves 0.00025 for the rest

Current Results (guess I should revert the network order):
./validation -g 1 -n ~/5b/c83e1b6e0ffbf8e684f2d8f6261853f14c553b29ee0e70ff6c34e87d28009c43 -n /root/6b/leelaz-model-352000.txt -k 352k | tee 352k.log
312k.log:16 wins, 38 losses
352k.log:2 wins, 14 losses

Looks promising too.

alreadydone · 2018-02-22T20:54:05Z

I read about CNN backpropagation and it seems that freezing 6 blocks of weights can indeed cut training time by 60%: there are roughly the same number of weights in a fc layer as in a convolutional layer (361x2x362 in policy head vs. 128x128x9), but the amount of computation in convolutional layers need to be multiplied by 361x361, so it dominates. Hope I didn't get these wrong... Training may not be the bottleneck though.
(Edit: I overlooked that forward propagation still need to be done in full, so maybe a 30% reduction in training time is a better estimate.)

bood · 2018-02-26T03:18:16Z

More match results about the 6x128 leelaz-model-352000.txt I trained:

47 wins, 12 loses (data from last post included)

Also, I tested some ladder scenarios with the tool mentioned in #905 , it shows this 6x128 performs better compared to ed0 which is trained by SL, still worse than c83 though (leelaz v0.12 1600 playouts):

c83	ed0	352k
51/70	13/70	23/70

alreadydone · 2018-02-28T20:07:21Z

Is GPU memory what is limiting us from using the same batch size (2048) as in AGZ paper? If so, this work which trades off memory usage with time may be useful. Implementation is in Tensorflow as well (modifies the computation graph). A graph near the bottom shows the effect for up to 20 blocks with batch size 1280.
Eventually we'll switch to larger networks and self-play games will be generated more slowly, so a slight increase in training time seems acceptable. (larger batch size may even be faster?) Shuffling may need to be further optimized to work with larger batch size, though.

alreadydone · 2018-03-28T07:43:36Z

@Ttl I guess net2net doesn't currently support adding input planes. Would you mind adding in this functionality? I am considering adding a liberty plane to pass to the NN the number of liberties of the group containing each stone, and this also seems to be Minigo's plan. This helps bridge the gap between the local nature of convolutional filters and the global/long-range aspects of Go. When combined with tree search and 3x3 filters, this will allow the network to deal with (self-)atari, illegal moves, life-and-death, and ladder problems more efficiently (using less of its capacity, which is especially crucial for a small-sized network). From what I heard people expect LZ to easily beat top pros and rise to the tier of DeepZenGo and Fine Art once these problems are fixed. I think there's enough evidence that it's much more efficient to count liberties with an optimized algorithm than forcing the network to grasp it. Hopefully introducing this plane will boost the training accuracy and then the strength of the network through self-play.

After introducing the liberty plane:

illegal moves: all adjacent same-color stones have 1 liberty, and all adjacent opposite-color stones have more than 1 liberty (if some has only 1, the move is legal).
(self-)atari: won't have trouble seeing a large group having only one liberty.
ladders: towards the end of the tree search, won't have problem seeing a huge group having one liberty (although may not exactly knows how large it is) and considering the atari/escaping moves as the most urgent.
life-and-death: better at semeai (capturing races); won't have problem capturing a dead group in a no-resign game with an "illegal move".

@gcp Having become the strongest engine on everyone's call, nowadays Leela Zero is not just pure (reproducing) scientific research, but also a widely used tool for analyzing and learning Go **; as you said, we are pushing the limit. What will benefit the most people is a network that is simultaneously strong and efficient (in computing resources), friendlier to the environment, to weaker machines, and to mobile devices. Some people may still want to see the limit a Zero bot can eventually reach with a full-sized 20x256 or 40x256 network (expected to read all meaningful ladders correctly with ? playouts). However, AFAIK there's currently no plan to switch to 20x256 because it's too slow, and even the strong 20x256 network 93229e still shows ladder problems. I do have some new ideas to help bring in more computing power, though it will take some time and momentum to start running.

If most people stick to the Zero approach and don't agree to use a liberty-enhanced network for self-play, being able to add the plane through net2net at least allows interested people to train such a network themselves. Since the new input plane is not included in current training data, the liberties would better be calculated during training; hopefully some additions to the training code will also allow people to experiment with other input features.

Footnote: I don't see any point using separate planes for 1,2,3,... liberties as in the first AG paper; liberties are additive when you connect two groups (for this to work better, put the number of empty adjacent points for an empty point in the input plane), and AlphaZero also used only one plane for move count.

** A post showing Go schools in Shanghai have adopted AI and mentioning using LZ to review games.

gcp · 2018-03-28T11:50:35Z

You know you don't need net2net support for that right, and in fact using net2net for that seems extremely bad: quite a bit of the existing net is going to be trained to count liberties, which you're now making redundant. What would be the point of retaining that training?

You should just train a net from randomly initialized weights for that.

I have no interest in doing augmented approaches here. For sure they are interesting and may produce stronger bots. The code is open, you can fork it and try whatever you want.

bochen2027 · 2018-03-28T16:57:44Z

@alreadydone At first I thought your crypto currency idea was way out there but now I think it could work, I posted some ideas of my own that somewhat overlap yours. The idea is to bake in crypto/blockchain within the Go GUI itself and integrate Sabaki with LZ.

SabakiHQ/Sabaki#358

alreadydone · 2018-03-30T04:15:23Z

quite a bit of the existing net is going to be trained to count liberties, which you're now making redundant.

Interesting point. I think net2net still serves as a smart initialization though, and the network will forget unnecessary incomplete knowledge about liberties and retain those that remains crucial after introducing the plane (e.g. using liberties to make decisions); at least net2net would be the first thing I try. It seems some people prefer redundancy and even introduce dropout layers which we don't have ...

I haven't looked into the training code yet, but for inference I think gather_features is the correct place to insert the plane. count_rliberties has been removed from FastBoard, and it looks like m_libs is counting the pseudo-liberties? I hope count_rliberties found in https://github.com/gcp/leela-zero/blob/5fb3b4d1d7b03b69e71e0729016078e4b3d32cd1/src/FastBoard.cpp still works; could it be made more efficient if we want to count liberties of every stone/group? (Related: #173)

lightvector · 2018-03-30T05:06:30Z

@alreadydone

Footnote: I don't see any point using separate planes for 1,2,3,... liberties as in the first AG paper; liberties are additive when you connect two groups (for this to work better, put the number of empty adjacent points for an empty point in the input plane), and AlphaZero also used only one plane for move count.

Yeah, I also expect a neural net would be able to make use of a single real-valued liberties plane decently well. But it's also not crazy to encode them as separate planes. Encoding it additively in a single plane is sort of like a weak prior that the relevant things the neural net wants to compute should vary smoothly with that value and even monotonely or linearly, encoding it in separate planes is sort of like a weak prior that each value behaves distinctly, either way the neural net can learn from data to override that prior. I'm not actually sure which one would perform better. You should try it and compare!

By the way, my experience is that for the policy part of the net at least, only atari, 2, and 3 liberties are important for the neural net to understand, after that you get heavy diminishing returns. For example, when I trained a neural net with about 5 blocks with a typical set of non-"zero" features (liberties, ladder status, etc), experimentally I tried removing single input planes without retraining the neural net to see how much unexpectedly losing each input plane alone harmed the prediction accuracy for pro moves.

The baseline test set accuracy was 52.18%. Losing "opponent stone" and "own stone" massively dropped the accuracy to 5.16%, and 4.89%, respectively - obviously these are critical features. Losing "own group has 1 liberty" and "enemy group has 1 liberty" dropped it to 46.93% and 48.19%. Losing "own group has 2 liberties" and "enemy group has 2 liberties" dropped it to 50.16% and 50.10%. For 3 liberties, it was 51.45% and 51.40%. And for 4 liberties, it was 52.01%, and 52.23% - so by the time you got to the 4 liberties feature, it was hardly using it at all anyways and barely noticed whether you included the feature or not.

Ttl · 2018-03-30T09:29:23Z

I'll see if I can add increasing the number of input planes functionality. I'm not sure if it makes sense to use it or not, but I don't think it's that hard to implement.

Pull request leela-zero#704.

barrybecker4 and others added 2 commits January 19, 2018 22:00

Aavoid duplicate code in display_board.

48c6ad2

Pull request leela-zero#682.

net2net initial implementation

13cc644

Widening has still some bugs and the output doesn't match. Adding more blocks works.

Ttl mentioned this pull request Jan 21, 2018

Investigate better ways to migrate between network sizes #648

Closed

isty2e reviewed Jan 21, 2018

View reviewed changes

Python 3 fix

d5a759e

Hersmunch reviewed Jan 21, 2018

View reviewed changes

isty2e mentioned this pull request Jan 22, 2018

Tune first play urgency ? #696

Open

Ttl added 2 commits January 22, 2018 16:10

Fix adding of noise in widened layers

eda82e5

Add batch norm to verification

0d1b859

gcp merged commit 6f93c0e into leela-zero:next Jan 24, 2018

earthengine pushed a commit to earthengine/leela-zero that referenced this pull request Jan 25, 2018

Add net2net.py for enlarging networks.

e3bdf18

Pull request leela-zero#704.

alreadydone mentioned this pull request Mar 1, 2018

Version 0.10 released - Next steps #591

Open

Mardak mentioned this pull request Mar 6, 2018

128x10 transition (6615567e) #965

Open

alreadydone mentioned this pull request Mar 23, 2018

Bigger, stronger, not faster (4dc12a8e etc) #1030

Open

bochen2027 mentioned this pull request Mar 28, 2018

Leela Z integration ideas SabakiHQ/Sabaki#358

Open

alreadydone mentioned this pull request Mar 30, 2018

Could snapbacks be a problem? #877

Open

alreadydone mentioned this pull request Apr 10, 2018

A New 20b128f Net(In Progress) #1168

Closed

zhiyue pushed a commit to awesome-archive/leela-zero that referenced this pull request Apr 14, 2018

Add net2net.py for enlarging networks.

13345ca

Pull request leela-zero#704.

net2net #704

net2net #704

Conversation

Ttl commented Jan 21, 2018 • edited Loading

isty2e Jan 21, 2018

Choose a reason for hiding this comment

isty2e commented Jan 21, 2018

Ttl commented Jan 21, 2018

isty2e commented Jan 21, 2018

Ttl commented Jan 21, 2018

isty2e commented Jan 21, 2018

Hersmunch Jan 21, 2018 • edited Loading

Choose a reason for hiding this comment

Ttl Jan 22, 2018

Choose a reason for hiding this comment

Hersmunch commented Jan 21, 2018 • edited Loading

Ttl commented Jan 22, 2018

bood commented Jan 22, 2018 via email

Ttl commented Jan 22, 2018

bood commented Jan 22, 2018 via email

Ttl commented Jan 22, 2018

bood commented Jan 22, 2018 via email

roy7 commented Jan 22, 2018

isty2e commented Jan 22, 2018

jkiliani commented Jan 22, 2018

bood commented Jan 22, 2018 via email

isty2e commented Jan 22, 2018

bood commented Jan 22, 2018

isty2e commented Jan 22, 2018

Mardak commented Jan 22, 2018

Ttl commented Jan 22, 2018

roy7 commented Jan 23, 2018

roy7 commented Jan 23, 2018

alreadydone commented Jan 24, 2018 • edited Loading

alreadydone commented Jan 24, 2018

Ttl commented Jan 24, 2018

alreadydone commented Feb 20, 2018

Ttl commented Feb 20, 2018

bood commented Feb 20, 2018

Ttl commented Feb 20, 2018

bood commented Feb 20, 2018 • edited Loading

Ttl commented Feb 20, 2018

bood commented Feb 22, 2018 • edited Loading

alreadydone commented Feb 22, 2018 • edited Loading

bood commented Feb 26, 2018

alreadydone commented Feb 28, 2018 • edited Loading

alreadydone commented Mar 28, 2018 • edited Loading

gcp commented Mar 28, 2018

bochen2027 commented Mar 28, 2018

alreadydone commented Mar 30, 2018 • edited Loading

lightvector commented Mar 30, 2018 • edited Loading

Ttl commented Mar 30, 2018

Ttl commented Jan 21, 2018 •

edited

Loading

Hersmunch Jan 21, 2018 •

edited

Loading

Hersmunch commented Jan 21, 2018 •

edited

Loading

alreadydone commented Jan 24, 2018 •

edited

Loading

bood commented Feb 20, 2018 •

edited

Loading

bood commented Feb 22, 2018 •

edited

Loading

alreadydone commented Feb 22, 2018 •

edited

Loading

alreadydone commented Feb 28, 2018 •

edited

Loading

alreadydone commented Mar 28, 2018 •

edited

Loading

alreadydone commented Mar 30, 2018 •

edited

Loading

lightvector commented Mar 30, 2018 •

edited

Loading