Binary file format #1740

OmnipotentEntity · 2018-08-17T07:24:41Z

This is a first attempt at defining a binary file format.

I fully expect something to be wrong, and I believe that in fact there is something already wrong with the value net output (though the policy net seems fine at least), though I cannot locate what I've done wrong. So if someone who is more comfortable with the format of the network can chime in that would be helpful.

As mentioned in #1733, this is not enough still. Because neither the training pipeline nor the server yet has support for this file format. But this should be enough to get started on those.

The latest 20b network is only 48042715 bytes (46MB) when uncompressed. It does compress a bit further with gzip, down to 43MB. This is under half of the gzip compressed file on the server, so this is a big gain in that area at the very least.

ihavnoid · 2018-08-17T10:01:24Z

src/Network.cpp

+            if (!block) { // Very first has a different shape because it's the input layer
+                count = filters * 162;
+            }
+            std::vector<float> weights = process(count);


Suggest putting

auto weights = process(count);

and something similar for rest of the process() calls.

ihavnoid · 2018-08-17T10:04:36Z

src/Network.cpp

+            return {0, 0};
+        }
+        std::copy(cbegin(weights), cend(weights), begin(m_ip2_val_b));
+    }


This looks like a lot of repetitive code. How about moving the code body into process() calls and handle the return {0,0}; case as an exception throw?

Good point. I didn't see a lot of exceptions in the code, so I thought there was some sort of unwritten rule against them. But I suppose I was being over cautious.

ihavnoid · 2018-08-17T10:08:04Z

I like the implementation - but suggest using auto whenever possible, for consistency with rest of the leela-zero code.

Ttl · 2018-08-17T12:18:14Z

The output with #1736 is 52 MB compressed. If the space saving from binary is just 9 MB I'm not sure if it's worth the trouble.

ihavnoid · 2018-08-17T13:00:53Z

@Ttl I guess we have to see the size of the compressed binary format. Plus, this is without losing any precision, and this code actually includes half-precision binary formats too. My hope is that half precision weight will further save area while not losing much strength, but that will need some experimenting.

One question that I had was whether it is okay to lose precision on the final fully connected layers (when half precision is used to store weights). The previous half precision implementations used half precision only within the residual tower so it isn't totally equivalent. I would like to make sure that half-precision binary weights are okay, or at least keep the final few layers to be single precision even when half precision format is used.

ihavnoid · 2018-08-17T13:08:36Z

Okay perhaps I misunderstood - the 48MB numbers are for the fp16 type. So yes probably we aren't saving a lot of disk. But anyway will probably need to see the strength difference.

OmnipotentEntity · 2018-08-17T14:35:29Z

@Ttl I could also attempt some form of value length encoding to save more space. The reason why the space saving seems less than it ought to is because we're already not storing a lot of entropy. Though I'd argue that 9 MB on 52MB is still significant, as it's around 16%.

…ariable.

OmnipotentEntity · 2018-08-19T04:04:04Z

This should be everything outside of the training code, which I will be the first to admit I do not fully understand and do not really have a good method of testing. May I get a review? Is the file format sane and fully featured? Does the code make sense and is it readable? Is this something that the project is interested in?

ihavnoid · 2018-08-19T08:03:00Z

My opinion is that we should eventually migrate to some form of binary format. I believe the binary format should be at least a little bit faster than the text format (since we don't need any parsing at all) if done properly. The current text files aren't that useful for debugging anyway (large files with very long lines - too hard to do anything on a text editor) but this may be just me.

The reason I say "eventually" is because changing file formats can cause quite a bit of trouble - we have to force everybody to move beyond a certain version (which we sort of do when forcing a minimum autogtp/leelaz version for training) but I am not sure what else we are breaking...

OmnipotentEntity · 2018-08-19T19:14:24Z

Did a little bit of performance tuning, these are the times it takes to load various difference versions of the same weights:

Type (20x256)	Seconds
v3	0.731481
v3.gz	0.652128
v1	1.35178
v1.gz	1.13941

Note, this is a computer is a very fast hard drive, if you have a slower hard drive, this difference will be more pronounced.

About not being sure what else we're breaking: I wish I could be certain that this worked 100% properly. Does anyone have some computer time they want to lend? I'd run matches like I used to, but I burned out one of my cards from overuse, and I haven't been able to replace it.

Hersmunch · 2018-08-19T20:21:20Z

src/Network.cpp

@@ -30,6 +30,7 @@
 #include <boost/utility.hpp>
 #include <boost/format.hpp>
 #include <boost/spirit/home/x3.hpp>
+// #include <ctime>


Some debug code has been left in

I commented out the timing code, so that other people could replicate my findings. I'll remove it entirely if need be.

ihavnoid · 2018-08-19T23:46:17Z

@OmnipotentEntity - maybe not 'breaking', but at least this will make things be a bit more complicated. I think we should still keep the 32-bit formats for training, but only the distribution versions of the net will be binary-converted to 16-bit. This adds a bit more complexity.

Other than that, my only other concern is that we also reduced precision for the final 'dual head' part of the net. Our current half precision net implementation applies half precision on the residual towers, and the 'dual head' is still single precision. I see that the final dual-head parameters have quite big numbers (~70) and on some older nets I recall seeing values in the range of thousands. My suggestion is that we should be a bit more conservative (after all, the 'dual head' is only a very small part of the weights) unless we have proof that it is okay to use fp16 even on the final heads.

OmnipotentEntity · 2018-08-20T13:04:59Z

A kind person allowed me to use his computer to set up a match. So far the result after 120 games is exactly 60-60, so there doesn't seem to be much problem with the loss of precision so far.

leelaz-next-3 v leelaz-next-1 (120/400 games)
board size: 19   komi: 7.5
                wins              black         white       avg cpu
leelaz-next-3     60 50.00%       24 40.00%     36 60.00%    337.32
leelaz-next-1     60 50.00%       24 40.00%     36 60.00%    327.70
                                  48 40.00%     72 60.00%

gcp · 2018-08-20T15:15:10Z

Quick thoughts:

a) We're not sure what the best float format is, see "half" vs "Bfloat16" discussion. Ironically the text format can deal with both. Would be nice to have some flexibility there if we have a binary one.

b) Without working training code we can't deploy such nets.

c) If we are revising the network format, we should add beta and gamma weights to the batchnorm layers. We currently encode the beta into the convolution bias of the layer before, but we can't encode gamma in a way that we can reconstruct it afterwards during training. This is a long-standing design mistake in Leela Zero.

gcp · 2018-08-20T16:24:22Z

I'm closing this to make clear it's not going to get pulled in its current form. The gain is all things considered little (except in loading perf for large networks!), there is a lot to do to make it usable, and the design needs more fleshing out.

gcp · 2018-08-20T16:26:11Z

May I get a review? Is the file format sane and fully featured? Does the code make sense and is it readable? Is this something that the project is interested in?

I hope I sufficiently addressed this. We can continue in an issue if you want.

Improving the network format is something we should do, but it has a large impact as it needs changes all over, so I'd prefer to do it right or not at all.

gcp · 2018-08-20T16:39:15Z

Whethe process_bn_var should be done at writing time is another thing to consider.

OmnipotentEntity added 5 commits August 17, 2018 03:11

Added parser for new binary file format

0aa40f7

Added script to convert from v1 weights file format to v3

b14ae5e

Mantessa -> Mantissa

26bd91d

Attempting to fix the Travis build

f7c0089

Very minor whitespace adjustment to make Travis happy

928524f

ihavnoid reviewed Aug 17, 2018

View reviewed changes

Cleaned up code, fixed a minor correctness bug

7fe5e4d

OmnipotentEntity added 4 commits August 17, 2018 10:49

Note to self, don't code immediately after waking up.

a0b6e57

Fix bug with handling the value head, now heatmap outputs match

d47bcb6

Stray whitespace making Travis angry

40d84dd

Actually really fix the offset reporting issue by using the correct v…

2d6e5bb

…ariable.

OmnipotentEntity mentioned this pull request Aug 19, 2018

Binary file format (server side) leela-zero/leela-zero-server#204

Closed

Replaced pow, increasing perf. Added an additional sanity check.

181a0e3

Hersmunch reviewed Aug 19, 2018

View reviewed changes

gcp closed this Aug 20, 2018

OmnipotentEntity mentioned this pull request Aug 20, 2018

Roadmaps for Binary File Format implementation #1751

Closed

8 tasks

OmnipotentEntity deleted the binary-file-format branch July 6, 2019 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary file format #1740

Binary file format #1740

OmnipotentEntity commented Aug 17, 2018 •

edited

ihavnoid Aug 17, 2018

ihavnoid Aug 17, 2018

OmnipotentEntity Aug 17, 2018

ihavnoid commented Aug 17, 2018

Ttl commented Aug 17, 2018

ihavnoid commented Aug 17, 2018

ihavnoid commented Aug 17, 2018

OmnipotentEntity commented Aug 17, 2018 •

edited

OmnipotentEntity commented Aug 19, 2018

ihavnoid commented Aug 19, 2018

OmnipotentEntity commented Aug 19, 2018 •

edited

Hersmunch Aug 19, 2018

OmnipotentEntity Aug 20, 2018

ihavnoid commented Aug 19, 2018

OmnipotentEntity commented Aug 20, 2018 •

edited

gcp commented Aug 20, 2018 •

edited

gcp commented Aug 20, 2018

gcp commented Aug 20, 2018

gcp commented Aug 20, 2018

Binary file format #1740

Binary file format #1740

Conversation

OmnipotentEntity commented Aug 17, 2018 • edited

ihavnoid Aug 17, 2018

Choose a reason for hiding this comment

ihavnoid Aug 17, 2018

Choose a reason for hiding this comment

OmnipotentEntity Aug 17, 2018

Choose a reason for hiding this comment

ihavnoid commented Aug 17, 2018

Ttl commented Aug 17, 2018

ihavnoid commented Aug 17, 2018

ihavnoid commented Aug 17, 2018

OmnipotentEntity commented Aug 17, 2018 • edited

OmnipotentEntity commented Aug 19, 2018

ihavnoid commented Aug 19, 2018

OmnipotentEntity commented Aug 19, 2018 • edited

Hersmunch Aug 19, 2018

Choose a reason for hiding this comment

OmnipotentEntity Aug 20, 2018

Choose a reason for hiding this comment

ihavnoid commented Aug 19, 2018

OmnipotentEntity commented Aug 20, 2018 • edited

gcp commented Aug 20, 2018 • edited

gcp commented Aug 20, 2018

gcp commented Aug 20, 2018

gcp commented Aug 20, 2018

OmnipotentEntity commented Aug 17, 2018 •

edited

OmnipotentEntity commented Aug 17, 2018 •

edited

OmnipotentEntity commented Aug 19, 2018 •

edited

OmnipotentEntity commented Aug 20, 2018 •

edited

gcp commented Aug 20, 2018 •

edited