Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net2net #704

Merged
merged 9 commits into from
Jan 24, 2018
Merged

net2net #704

merged 9 commits into from
Jan 24, 2018

Conversation

Ttl
Copy link
Member

@Ttl Ttl commented Jan 21, 2018

Implementation of net2net (https://arxiv.org/abs/1511.05641) for leelaz network.

Adding blocks works, but widening has still some bugs and the output of the widened network isn't exactly equal to the old one. It's still pretty close and it should be a better initialization than random even now.

Just putting this here in case anyone else was working on this. I'll investigate the widening later.

Example usage:

python net2net.py 1 64 <network> adds 1 block and 64 filters to the network. Output is written to <network>_net2net.

barrybecker4 and others added 2 commits January 19, 2018 22:00
Widening has still some bugs and the output doesn't match.
Adding more blocks works.
return weights, next_weights

rand = range(channels)
rand.extend(np.random.randint(0, channels, new_channels))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python 3, ranges cannot be directly extended:

AttributeError: 'range' object has no attribute 'extend'

Changing rand = range(channels) to rand = list(range(channels)), it works fine for me.

@isty2e
Copy link

isty2e commented Jan 21, 2018

I tested with the latest best net, adding a block and 128 channels, and it does not undergo any verification problem. In "the output of the widened network isn't exactly equal to the old one", do you mean the deviation under 1e-6?

@Ttl
Copy link
Member Author

Ttl commented Jan 21, 2018

Verification in the program works. However if you try the widened net in leelaz it doesn't give the same heatmap.

@isty2e
Copy link

isty2e commented Jan 21, 2018

Do you have any slightest clue what is the problem here? Having read the code, I still don't see why the heatmap and NN evals are different.

@Ttl
Copy link
Member Author

Ttl commented Jan 21, 2018

As you can see the included verification code passes. I have no idea why the heatmaps are different.

The verification is calculated without batch norms and residual connections so maybe something related to those? Although I believe I'm processing the batch norm weights correctly and the residual connection shouldn't affect the widening. Or maybe there's a bug with the text processing somewhere?

@isty2e
Copy link

isty2e commented Jan 21, 2018

As far as I have tested, text processing and I/O seem okay, which was tested by adding 0 block and 0 channel, and removing if new_channels == 0 in conv_bn_wider(). Also after adding blocks and filters, the first values of weights were appropriately devided by rep_factor, and I don't think there is anything wrong with the batchnorm layer too. Can there anything with the last layers?


for i in range(len(rand)):
noise = 0
if i > channels:
Copy link
Member

@Hersmunch Hersmunch Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reading through to learn/try to understand. Might be wrong but should this be >= ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It should be >=.

@Hersmunch
Copy link
Member

Hersmunch commented Jan 21, 2018

Does the order of deepening and widening matter? Has the order of deepen then widen been chosen for a particular reason?
From what I read I don't think it matters but perhaps it is cheaper/simpler to widen then deepen since the deepening appears to be cheaper. I think it also would mean that noise won't be applied twice to the new widened filters in the new blocks. Does that make sense?

@Ttl
Copy link
Member Author

Ttl commented Jan 22, 2018

It was easier to do it this way since widening also modifies the next layer inputs. New channels in the new blocks will have more noise, but I'm not worried about that.

@bood
Copy link
Collaborator

bood commented Jan 22, 2018 via email

@Ttl
Copy link
Member Author

Ttl commented Jan 22, 2018

The modified network should have identical outputs to the old one. It's meant to be used as initialization for the training. Training should converge much quicker when output is already almost correct compared to starting from completely random network.

@bood
Copy link
Collaborator

bood commented Jan 22, 2018 via email

@Ttl
Copy link
Member Author

Ttl commented Jan 22, 2018

Training of the 6 block network was started from random initialization. There aren't other alternatives right now. Don't confuse random weight initialization with self-play of random network.

This doesn't solve whatever problem is causing the value network to be overfit. Network initialized with this method still needs to be trained and it would end up being overfit in the same way.

@bood
Copy link
Collaborator

bood commented Jan 22, 2018 via email

@roy7
Copy link
Collaborator

roy7 commented Jan 22, 2018

I get @bood's point and I assume he's correct. If we extend the best 5x64 into a 6x128 in a way that has the same results for all inputs, it's simply larger, it should basically just keep going from where 5x64 is currently without the artifacts a supervised training 6x128 has had.

If nothing else, if we can extend the network successfully, training could be tried with both 6x128 networks to see which can promote to a better new network sooner. And it'd also be a success to build on later for 7x128 or 6x256 or whatever once the technique is proven.

@isty2e
Copy link

isty2e commented Jan 22, 2018

My opinion is that it is unclear if there is any overfitting in the current 6-block networks since there is no evidence to support it, and if any, it can be cured by the self-play reinforcement learning. At any rate, this provides a safe and fast connection from a small net to a larger one, hence the bootstrapping is much easier and reliable. Of course, the first training cycle after bootstrapping probably will take more time compared with a regular cycle.

@jkiliani
Copy link

Has any play testing been done yet how the enlarged net does compared to the base version when matched against each other?

@bood
Copy link
Collaborator

bood commented Jan 22, 2018 via email

@isty2e
Copy link

isty2e commented Jan 22, 2018

Yes, due to the issue @bood mentioned it is not complete yet. Adding blocks does not matter, but upon addition of filters the heatmap and eval are somehow distorted, so it is not the time for actual match-tests yet. Like I said, I could not find why that happened, so if anyone is interested, please take a look and figure out why.

@bood
Copy link
Collaborator

bood commented Jan 22, 2018

My opinion is that it is unclear if there is any overfitting in the current 6-block networks since there is no evidence to support it, and if any, it can be cured by the self-play reinforcement learning. At any rate, this provides a safe and fast connection from a small net to a larger one, hence the bootstrapping is much easier and reliable. Of course, the first training cycle after bootstrapping probably will take more time compared with a regular cycle.

Yes, it's maybe not due to overfitting, I don't what it should be called, but it seems that supervise trained 6b would lose some information. e.g. c83 knows quite a bit of ladder but ed0/7fd seems to be weak on ladder since c83 plays little ladder so ed0/7fd has nothing to learn from. Just like the weights.txt we get from human games.

On the other hand, if we have an identical 6b from 5b, we even don't have any bootstrapping problem! we can continue to include those self-play games generated from c83 in the training window, since in theory this 6b would just generate the same games as c83.

@isty2e
Copy link

isty2e commented Jan 22, 2018

@bood Due to the difference in the training set, the skills they have learned should be different, so it will be eventually handled once we are on track again. And even using this net2net method, practically a larger net is expected to behave differently, since we have to add random noise to break the symmetry in weights to avoid some neurons to be exactly identical. Thus there should be some training before replacing the best net.

Still, this can save a lot of time, since when we start with a larger net, we do not have to start from a random network and train it with all the games played historically, but instead start with a net2net version and train with a given training window.

@Mardak
Copy link
Collaborator

Mardak commented Jan 22, 2018

Does it make sense to try training a deeper net first? E.g., 5x64 -> 10x64?

$ ./net2net.py 5 0 c83e1b6e0ffbf8e684f2d8f6261853f14c553b29ee0e70ff6c34e87d28009c43
Version 1
Channels 64
Blocks 5
Processing block 1
Processing block 2
Processing block 3
Processing block 4
Processing block 5
Processing block 6
Processing block 7
Processing block 8
Processing block 9
Processing block 10

$ ./leelaz -w c83e1b6e0ffbf8e684f2d8f6261853f14c553b29ee0e70ff6c34e87d28009c43_net2net
…
Detecting residual layers...v1...64 channels...10 blocks.
…
Leela: heatmap
  0   0   1   1   1   1   1   0   0   0   0   0   0   0   0   0   1   1   1 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0 
  0   1   3   5   1   0   1   1   1   1   1   1   1   0   1   7   3   1   0 
  0   1   5 144   3   1   1   1   1   1   1   1   1   0   2 133   6   1   0 
  0   1   1   3   1   1   1   1   1   1   1   1   1   1   1   3   1   1   0 
  0   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0 
  0   1   0   0   1   1   1   1   1   1   1   1   1   1   1   0   0   1   0 
  0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   0 
  0   1   5 134   2   0   1   1   1   1   1   1   1   0   2 127   5   1   0 
  0   1   3   6   1   0   0   1   1   1   1   1   1   0   1   5   3   1   0 
  0   1   1   1   1   1   1   0   1   0   1   0   1   1   1   1   1   1   0 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
pass: 0
winrate: 0.440024

@Ttl
Copy link
Member Author

Ttl commented Jan 22, 2018

The non-matching output is caused by the residual connection. The first convolution has some channels duplicated and the next layer's inputs for the duplicated channels are divided by the duplication factor causing the output to be the same as before. Residual connection however messes this up since it doesn't include the division by the duplication factor.

To confirm it I commented the residual connection out in the leelaz source code and the output matches.

However I'm not sure what would be the best way to fix it. This might be the closest it gets.

@roy7
Copy link
Collaborator

roy7 commented Jan 23, 2018

Yes no worries. Thanks for your feedback. This is just a one time thing to assure myself net2net was working as expected.

@roy7
Copy link
Collaborator

roy7 commented Jan 23, 2018

Final results of -p 50 5x64 vs 6x128 net2net when I stopped it, 265 wins, 246 losses.

@alreadydone
Copy link
Contributor

alreadydone commented Jan 24, 2018

Forgive my obsession in math but I came up with another randomization. I suspect that the existing randomizations already make the enlarged networks trainable though, even without noise (all weights should already be distinct, though I don't know whether this suffices), and we probably don't need to widen anymore unless we transition from 128 filters to 196 or 256.

The idea is that w(j) can be chosen independently for each output channel h and each spot in the 3x3 convolutional kernel. I didn't point out that w(j) can also be allowed to depend on h, though this is a trivial observation and maybe already implemented? Even negative values of w(j) can be allowed, though I don't know whether it's advantageous to keep it close to 1/c_j so as not to affect the variance of the weights.

Since the layers are 3x3 convolutional, the q inputs are effectively 9q inputs, and we label the r-th entry in the j-th input channel by {j,r} (r=1,2,...9). Then for any function w(j,r,h) satisfying
Σx such that g(x)=g(j)w(x,r,h)=1 for all j,r,h,
we can let
U(i+1){j,r},h=w(j,r,h)W(i+1){g(j),r},h
(cf. page 4 of paper)

(Just to clarify, the skip connection does not need to be divided by anything. The inputs/outputs of each layer of the enlarged network are supposed to be the same as the original network, only replicated into (different numbers of) copies.)

@alreadydone
Copy link
Contributor

What I said above is not completely correct; it only works if we are widening but not deepening. If we are only deepening but not widening, it seems inevitable to add noise to get rid of the completely 0 weights in the added block. But if we deepen and then widen, then the randomizations above apply, and we can expect a trainable enlarged network without noise (hence with identical output). The formula needs to be modified though.

First of all, to ensure that we get rid of all of the zeros, g(j) must hit each number in 1,2,..64 at least twice, so we must use g(j)=j for 1≤j≤64 and g(j)=j-64 for 65≤i≤128 (or something equivalent) not only for the inputs/outputs of the blocks but also for the intermediate between the two convolutions. To ensure that all weights are distinct, we pick distinct numbers w(j,r,h) for 1≤j≤64, 1≤r≤9 and 1≤h≤(number of output channels at that step). We then simply take
U(i+1){j,r},h=w(j,r,h)
U(i+1){j+64,r},h=-w(j,r,h)
These should ensure that the added layer(s) have distinct weight.

(If noise is to be added, I think the new block should be added at the bottom (closest to the final output layer) to minimize accumulated errors.)

@gcp gcp merged commit 6f93c0e into leela-zero:next Jan 24, 2018
@Ttl
Copy link
Member Author

Ttl commented Jan 24, 2018

@alreadydone Interesting, but I think the current implementation is good enough as is. I don't think I'm going to implement it, but if you want you can code it yourself and submit a pull request.

earthengine pushed a commit to earthengine/leela-zero that referenced this pull request Jan 25, 2018
@alreadydone
Copy link
Contributor

After the 6x128 network stalls and we use net2net to migrate to 10x128, I wonder whether it's a good idea to freeze the old blocks in training. This seems to be a common technique in transfer learning and a good way to make sure the new network doesn't forget too much when subjected to training even with a high learning rate; the feature extracted by the 6-block network should be useful to extract further features anyway. (I would add the 4 new blocks near the output and only allow weights in these blocks and in the policy/value heads to vary.) Also this helps to reduce the training time. When 10x128 stalls we could unfreeze to squeeze out the last bit of potential.

I wonder whether this strategy works for transferring knowledge between different komi and board sizes. It seems we could just copy all the weights in the residual blocks, freeze all (except maybe the last few blocks), and only train the fully-connected layers from random initialization; the quality of the self-play games should improve faster/more steadily than if the blocks are not frozen. If anyone tries this please let me know about the result.

Reference: https://www.analyticsvidhya.com/blog/2017/06/transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/ Section "Ways to Fine tune the model"

@Ttl
Copy link
Member Author

Ttl commented Feb 20, 2018

I tried training a 8x128 network using net2net before and it seemed to work just fine without freezing weights. After 351k steps using 128 batch size I got the following result in validation:

101 wins, 75 losses
176 games played.
Status: 0 LLR 1.73266 Lower Bound -2.94444 Upper Bound 2.94444

Not statistically significant yet, but seems to be doing well considering this was the first network I tested and normally finding a better network seems to require testing many networks.

The previous issues with bootstrapping were probably caused by the shuffling bug.

@bood
Copy link
Collaborator

bood commented Feb 20, 2018

@Ttl Are you using 5x64 network c83 self-plays to do the training? What is the learning rate are you using? The net2net paper seems to suggest using a 1/10 learning rate for the new generated network?

I agree that we should do some testing on net2net before 6x128 network stalls.

@Ttl
Copy link
Member Author

Ttl commented Feb 20, 2018

I started with 63498 network. Learning rate was 0.005 for 150k steps and after that I dropped it to 0.0005.

@bood
Copy link
Collaborator

bood commented Feb 20, 2018

I see. But for batch size 128, shouldn't it be using 0.0025 and 0.00025 to match the current setting 512/0.005? And how you decide the number 150k as the large learning rate run steps?

@Ttl
Copy link
Member Author

Ttl commented Feb 20, 2018

Yes, if you want to match to the current parameters.

@bood
Copy link
Collaborator

bood commented Feb 22, 2018

Also tried net2net from 5x64 to 6x128. Trained with c83 self-plays.

Parameters:
Batch size: 128, Learning rate: 0.0025 for the first 152k moves 0.00025 for the rest

Current Results (guess I should revert the network order):
./validation -g 1 -n ~/5b/c83e1b6e0ffbf8e684f2d8f6261853f14c553b29ee0e70ff6c34e87d28009c43 -n /root/6b/leelaz-model-352000.txt -k 352k | tee 352k.log
312k.log:16 wins, 38 losses
352k.log:2 wins, 14 losses

Looks promising too.

@alreadydone
Copy link
Contributor

alreadydone commented Feb 22, 2018

I read about CNN backpropagation and it seems that freezing 6 blocks of weights can indeed cut training time by 60%: there are roughly the same number of weights in a fc layer as in a convolutional layer (361x2x362 in policy head vs. 128x128x9), but the amount of computation in convolutional layers need to be multiplied by 361x361, so it dominates. Hope I didn't get these wrong... Training may not be the bottleneck though.
(Edit: I overlooked that forward propagation still need to be done in full, so maybe a 30% reduction in training time is a better estimate.)

@bood
Copy link
Collaborator

bood commented Feb 26, 2018

More match results about the 6x128 leelaz-model-352000.txt I trained:

47 wins, 12 loses (data from last post included)

Also, I tested some ladder scenarios with the tool mentioned in #905 , it shows this 6x128 performs better compared to ed0 which is trained by SL, still worse than c83 though (leelaz v0.12 1600 playouts):

c83 ed0 352k
51/70 13/70 23/70

@alreadydone
Copy link
Contributor

alreadydone commented Feb 28, 2018

Is GPU memory what is limiting us from using the same batch size (2048) as in AGZ paper? If so, this work which trades off memory usage with time may be useful. Implementation is in Tensorflow as well (modifies the computation graph). A graph near the bottom shows the effect for up to 20 blocks with batch size 1280.
Eventually we'll switch to larger networks and self-play games will be generated more slowly, so a slight increase in training time seems acceptable. (larger batch size may even be faster?) Shuffling may need to be further optimized to work with larger batch size, though.

@alreadydone
Copy link
Contributor

alreadydone commented Mar 28, 2018

@Ttl I guess net2net doesn't currently support adding input planes. Would you mind adding in this functionality? I am considering adding a liberty plane to pass to the NN the number of liberties of the group containing each stone, and this also seems to be Minigo's plan. This helps bridge the gap between the local nature of convolutional filters and the global/long-range aspects of Go. When combined with tree search and 3x3 filters, this will allow the network to deal with (self-)atari, illegal moves, life-and-death, and ladder problems more efficiently (using less of its capacity, which is especially crucial for a small-sized network). From what I heard people expect LZ to easily beat top pros and rise to the tier of DeepZenGo and Fine Art once these problems are fixed. I think there's enough evidence that it's much more efficient to count liberties with an optimized algorithm than forcing the network to grasp it. Hopefully introducing this plane will boost the training accuracy and then the strength of the network through self-play.

After introducing the liberty plane:

  • illegal moves: all adjacent same-color stones have 1 liberty, and all adjacent opposite-color stones have more than 1 liberty (if some has only 1, the move is legal).
  • (self-)atari: won't have trouble seeing a large group having only one liberty.
  • ladders: towards the end of the tree search, won't have problem seeing a huge group having one liberty (although may not exactly knows how large it is) and considering the atari/escaping moves as the most urgent.
  • life-and-death: better at semeai (capturing races); won't have problem capturing a dead group in a no-resign game with an "illegal move".

@gcp Having become the strongest engine on everyone's call, nowadays Leela Zero is not just pure (reproducing) scientific research, but also a widely used tool for analyzing and learning Go **; as you said, we are pushing the limit. What will benefit the most people is a network that is simultaneously strong and efficient (in computing resources), friendlier to the environment, to weaker machines, and to mobile devices. Some people may still want to see the limit a Zero bot can eventually reach with a full-sized 20x256 or 40x256 network (expected to read all meaningful ladders correctly with ? playouts). However, AFAIK there's currently no plan to switch to 20x256 because it's too slow, and even the strong 20x256 network 93229e still shows ladder problems. I do have some new ideas to help bring in more computing power, though it will take some time and momentum to start running.

If most people stick to the Zero approach and don't agree to use a liberty-enhanced network for self-play, being able to add the plane through net2net at least allows interested people to train such a network themselves. Since the new input plane is not included in current training data, the liberties would better be calculated during training; hopefully some additions to the training code will also allow people to experiment with other input features.

Footnote: I don't see any point using separate planes for 1,2,3,... liberties as in the first AG paper; liberties are additive when you connect two groups (for this to work better, put the number of empty adjacent points for an empty point in the input plane), and AlphaZero also used only one plane for move count.

** A post showing Go schools in Shanghai have adopted AI and mentioning using LZ to review games.

@gcp
Copy link
Member

gcp commented Mar 28, 2018

You know you don't need net2net support for that right, and in fact using net2net for that seems extremely bad: quite a bit of the existing net is going to be trained to count liberties, which you're now making redundant. What would be the point of retaining that training?

You should just train a net from randomly initialized weights for that.

I have no interest in doing augmented approaches here. For sure they are interesting and may produce stronger bots. The code is open, you can fork it and try whatever you want.

@bochen2027
Copy link

@alreadydone At first I thought your crypto currency idea was way out there but now I think it could work, I posted some ideas of my own that somewhat overlap yours. The idea is to bake in crypto/blockchain within the Go GUI itself and integrate Sabaki with LZ.

SabakiHQ/Sabaki#358

@alreadydone
Copy link
Contributor

alreadydone commented Mar 30, 2018

quite a bit of the existing net is going to be trained to count liberties, which you're now making redundant.

Interesting point. I think net2net still serves as a smart initialization though, and the network will forget unnecessary incomplete knowledge about liberties and retain those that remains crucial after introducing the plane (e.g. using liberties to make decisions); at least net2net would be the first thing I try. It seems some people prefer redundancy and even introduce dropout layers which we don't have ...

I haven't looked into the training code yet, but for inference I think gather_features is the correct place to insert the plane. count_rliberties has been removed from FastBoard, and it looks like m_libs is counting the pseudo-liberties? I hope count_rliberties found in https://github.com/gcp/leela-zero/blob/5fb3b4d1d7b03b69e71e0729016078e4b3d32cd1/src/FastBoard.cpp still works; could it be made more efficient if we want to count liberties of every stone/group? (Related: #173)

@lightvector
Copy link

lightvector commented Mar 30, 2018

@alreadydone

Footnote: I don't see any point using separate planes for 1,2,3,... liberties as in the first AG paper; liberties are additive when you connect two groups (for this to work better, put the number of empty adjacent points for an empty point in the input plane), and AlphaZero also used only one plane for move count.

Yeah, I also expect a neural net would be able to make use of a single real-valued liberties plane decently well. But it's also not crazy to encode them as separate planes. Encoding it additively in a single plane is sort of like a weak prior that the relevant things the neural net wants to compute should vary smoothly with that value and even monotonely or linearly, encoding it in separate planes is sort of like a weak prior that each value behaves distinctly, either way the neural net can learn from data to override that prior. I'm not actually sure which one would perform better. You should try it and compare!

By the way, my experience is that for the policy part of the net at least, only atari, 2, and 3 liberties are important for the neural net to understand, after that you get heavy diminishing returns. For example, when I trained a neural net with about 5 blocks with a typical set of non-"zero" features (liberties, ladder status, etc), experimentally I tried removing single input planes without retraining the neural net to see how much unexpectedly losing each input plane alone harmed the prediction accuracy for pro moves.

The baseline test set accuracy was 52.18%. Losing "opponent stone" and "own stone" massively dropped the accuracy to 5.16%, and 4.89%, respectively - obviously these are critical features. Losing "own group has 1 liberty" and "enemy group has 1 liberty" dropped it to 46.93% and 48.19%. Losing "own group has 2 liberties" and "enemy group has 2 liberties" dropped it to 50.16% and 50.10%. For 3 liberties, it was 51.45% and 51.40%. And for 4 liberties, it was 52.01%, and 52.23% - so by the time you got to the 4 liberties feature, it was hardly using it at all anyways and barely noticed whether you included the feature or not.

@Ttl
Copy link
Member Author

Ttl commented Mar 30, 2018

I'll see if I can add increasing the number of input planes functionality. I'm not sure if it makes sense to use it or not, but I don't think it's that hard to implement.

zhiyue pushed a commit to awesome-archive/leela-zero that referenced this pull request Apr 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet