New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary file format #1740
Binary file format #1740
Conversation
src/Network.cpp
Outdated
if (!block) { // Very first has a different shape because it's the input layer | ||
count = filters * 162; | ||
} | ||
std::vector<float> weights = process(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest putting
auto weights = process(count);
and something similar for rest of the process() calls.
src/Network.cpp
Outdated
return {0, 0}; | ||
} | ||
std::copy(cbegin(weights), cend(weights), begin(m_ip2_val_b)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of repetitive code. How about moving the code body into process() calls and handle the return {0,0};
case as an exception throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I didn't see a lot of exceptions in the code, so I thought there was some sort of unwritten rule against them. But I suppose I was being over cautious.
I like the implementation - but suggest using |
The output with #1736 is 52 MB compressed. If the space saving from binary is just 9 MB I'm not sure if it's worth the trouble. |
@Ttl I guess we have to see the size of the compressed binary format. Plus, this is without losing any precision, and this code actually includes half-precision binary formats too. My hope is that half precision weight will further save area while not losing much strength, but that will need some experimenting. One question that I had was whether it is okay to lose precision on the final fully connected layers (when half precision is used to store weights). The previous half precision implementations used half precision only within the residual tower so it isn't totally equivalent. I would like to make sure that half-precision binary weights are okay, or at least keep the final few layers to be single precision even when half precision format is used. |
Okay perhaps I misunderstood - the 48MB numbers are for the fp16 type. So yes probably we aren't saving a lot of disk. But anyway will probably need to see the strength difference. |
@Ttl I could also attempt some form of value length encoding to save more space. The reason why the space saving seems less than it ought to is because we're already not storing a lot of entropy. Though I'd argue that 9 MB on 52MB is still significant, as it's around 16%. |
This should be everything outside of the training code, which I will be the first to admit I do not fully understand and do not really have a good method of testing. May I get a review? Is the file format sane and fully featured? Does the code make sense and is it readable? Is this something that the project is interested in? |
My opinion is that we should eventually migrate to some form of binary format. I believe the binary format should be at least a little bit faster than the text format (since we don't need any parsing at all) if done properly. The current text files aren't that useful for debugging anyway (large files with very long lines - too hard to do anything on a text editor) but this may be just me. The reason I say "eventually" is because changing file formats can cause quite a bit of trouble - we have to force everybody to move beyond a certain version (which we sort of do when forcing a minimum autogtp/leelaz version for training) but I am not sure what else we are breaking... |
Did a little bit of performance tuning, these are the times it takes to load various difference versions of the same weights:
Note, this is a computer is a very fast hard drive, if you have a slower hard drive, this difference will be more pronounced. About not being sure what else we're breaking: I wish I could be certain that this worked 100% properly. Does anyone have some computer time they want to lend? I'd run matches like I used to, but I burned out one of my cards from overuse, and I haven't been able to replace it. |
@@ -30,6 +30,7 @@ | |||
#include <boost/utility.hpp> | |||
#include <boost/format.hpp> | |||
#include <boost/spirit/home/x3.hpp> | |||
// #include <ctime> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some debug code has been left in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented out the timing code, so that other people could replicate my findings. I'll remove it entirely if need be.
@OmnipotentEntity - maybe not 'breaking', but at least this will make things be a bit more complicated. I think we should still keep the 32-bit formats for training, but only the distribution versions of the net will be binary-converted to 16-bit. This adds a bit more complexity. Other than that, my only other concern is that we also reduced precision for the final 'dual head' part of the net. Our current half precision net implementation applies half precision on the residual towers, and the 'dual head' is still single precision. I see that the final dual-head parameters have quite big numbers (~70) and on some older nets I recall seeing values in the range of thousands. My suggestion is that we should be a bit more conservative (after all, the 'dual head' is only a very small part of the weights) unless we have proof that it is okay to use fp16 even on the final heads. |
A kind person allowed me to use his computer to set up a match. So far the result after 120 games is exactly 60-60, so there doesn't seem to be much problem with the loss of precision so far.
|
Quick thoughts: a) We're not sure what the best float format is, see "half" vs "Bfloat16" discussion. Ironically the text format can deal with both. Would be nice to have some flexibility there if we have a binary one. b) Without working training code we can't deploy such nets. c) If we are revising the network format, we should add beta and gamma weights to the batchnorm layers. We currently encode the beta into the convolution bias of the layer before, but we can't encode gamma in a way that we can reconstruct it afterwards during training. This is a long-standing design mistake in Leela Zero. |
I'm closing this to make clear it's not going to get pulled in its current form. The gain is all things considered little (except in loading perf for large networks!), there is a lot to do to make it usable, and the design needs more fleshing out. |
I hope I sufficiently addressed this. We can continue in an issue if you want. Improving the network format is something we should do, but it has a large impact as it needs changes all over, so I'd prefer to do it right or not at all. |
Whethe process_bn_var should be done at writing time is another thing to consider. |
This is a first attempt at defining a binary file format.
I fully expect something to be wrong, and I believe that in fact there is something already wrong with the value net output (though the policy net seems fine at least), though I cannot locate what I've done wrong. So if someone who is more comfortable with the format of the network can chime in that would be helpful.
As mentioned in #1733, this is not enough still. Because neither the training pipeline nor the server yet has support for this file format. But this should be enough to get started on those.
The latest 20b network is only 48042715 bytes (46MB) when uncompressed. It does compress a bit further with gzip, down to 43MB. This is under half of the gzip compressed file on the server, so this is a big gain in that area at the very least.