Roadmaps for Binary File Format implementation #1751

OmnipotentEntity · 2018-08-20T22:34:50Z

As mentioned in #1740, there is a number of things that @gcp wants to hit when rethinking how network weights are done and implementing a binary format.

Decide on a robust, extensible binary format
Implement conversion scripts
Engine Support
Training Support
Server Support
Deciding on a 16-bit floating point format that covers the weight range without sacrificing precision
Add beta and gamma weights to batchnorm layers
process_bn_var might be better done at writing time

The text was updated successfully, but these errors were encountered:

OmnipotentEntity · 2018-08-20T22:38:08Z

Re: to the 16-bit floating point format. Why not examine all of the weights and use the minimum number of exponent bits to cover the range needed? Then the rest can simply be significand? Finally, store the number of exponent bits in a nibble in the header.

gcp · 2018-08-21T12:52:29Z

Why not examine all of the weights and use the minimum number of exponent bits to cover the range needed? Then the rest can simply be significand?

Sounds like a good approach, though you will want to make sure the significand has at least, say, 7 bits of precision or so?

If you can't combine that with the exponent, you may have to round-to-zero for small values and clamp big ones.

odeint · 2018-08-21T15:58:40Z

Decide on a robust, extensible binary format

How about using HDF5 as a container format instead of rolling our own standard? It will be slightly less efficient than pure arrays just serialised and dumped (+ some minimal metadata), but there is a lot of infrastucture around HDF5 already, and it is very extensible.

gcp · 2018-08-21T16:37:19Z

How about using HDF5

No, the HDF5 libs are not a reasonable dependency to add to Leela, for offering almost nothing of value. Leela Chess uses protobuff in some places, but I'm not too fond of that either, for similar reasons.

Note that HDF5 nor Protobuffs would not have helped with any of the points mentioned. (Aside from "extensible" where I'm not even sure what it means in this context. You're not going to extend anything if the processing code in the engine doesn't support it in the first place)

OmnipotentEntity · 2018-08-21T17:44:06Z

I have previously created a JSON-based binary format that I used in a game I previously published. This might be overkill, but it is an option.

By extensible, I meant that the file format can be extended to handle other situations (without having to change the file parsing code significantly, and without harming backwards compatibility), such as the ELF weight stuff the project has had to adapt to in the past.

jokkebk · 2018-08-21T21:11:38Z

Files are so large that one could conceivably put in the decoding code first, then just read that in, eval it and use the function to "decompress" the data format. "What could possibly go wrong" 😄

If we'd use a separate "compress/decompress" scripts after downloading to produce a current format weight file (slightly slow, but would introduce no changes to rest of the pipeline), we could even arrange a competition out of it: Who'll make the smallest network files with reasonable packing/unpacking speed.

(just to clarify: latter is an actual idea, first paragraph is just attempt at humor)

gcp · 2018-08-22T14:43:25Z

I considered that (just wrapping the existing format with encoder/decoder scripts) because it's much more manageable than updating all the code that deals with networks.

But you can't solve the missing batchnorm layer that way, unfortunately.

marcocalignano · 2018-08-23T14:10:04Z

We do not to reinvent the wheel, have a look at https://github.com/LLNL/zfp

gcp · 2018-08-23T14:43:51Z

Looks interesting, but I doubt our weights data has 2D spatial correlation, so not clear how well it would work.

odeint · 2018-08-23T15:27:57Z

We do not to reinvent the wheel, have a look at https://github.com/LLNL/zfp

That looks really cool. Reading the paper and following the literature a bit, I also found this development:
https://github.com/disheng222/SZ
http://www.mcs.anl.gov/papers/P5437-1115.pdf

In the paper they explain that they try do exploit smoothness in the data (which we don't really have, at least not obviously), but in case none is found:

Effective Error-bounded Lossy Compression for Unpredictable Data. We optimize (also with O(N) time complexity) the error-bounded lossy compression for compressing the unpredictable data, by elaborately analyzing the binary data representation based on desired precision and data value ranges

That's a little bit unspecific, but it is explained later in the arXiv article (paraphrasing here):

There are three key steps to process.

First, we map all of the unpredictable data to a smaller range by letting all the values minus the median value of the range (denoted by med, i.e., med=(mini (ρ[i]) + maxi (ρ[i]))/2). The newly generated data after the transformation are called normalized data, which will be closer to zero than the original ones. Such a step is motivated by the fact that the closer the number is to zero, the less mantissa bits are required to meet the specified precision. [...]

Second, we truncate the value by disregarding the insignificant mantissa part based on the user-required error bounds. [...]

At the last step, we perform the leading-zero based floating-point compression method to further reduce the storage size. In particular, we perform the XOR operation for the consecutive normalized values and compress each by using a leading zero count followed by the remaining significant bits. This may significantly reduce the storage size due to the fact that the leading bits of the IEEE 754 binary representations for the similar values are expected to be very similar to each other. [...]

So basically, they also truncate, but after normalising first. In the last step they still try to profit somewhat from correlations in the data, despite this being the section about unpredictable data.

OmnipotentEntity · 2018-08-24T12:00:24Z

Following up on the JSON idea, because based on what I said before it might not be entirely clear what I meant, here's a text format version of what the current save file might look like:

{
  "meta" : {
    "fp_size" : 16,
    "exp_bits" : 7,
    "value_head_not_stm" : false
  },

  "weights" : {
    "blocks" : 20,
    "filters" : 256,
    "res" : {
      "conv_weights" : [[],[],[]],
      "conv_biases" : [[],[],[]],
      "batchnorm_means" : [[],[],[]],
      "batchnorm_stddevs" : [[],[],[]]
    },

    "policy_head" : {
      "conv_pol_w" : [],
      "conv_pol_b" : [],
      "bn_pol_w1" : [],
      "bn_pol_w2" : [],
      "ip_pol_w" : [],
      "ip_pol_b" : [],
      "conv_val_w" : [],
      "conv_val_b" : [],
      "bn_val_w1" : [],
      "bn_val_w2" : [],
      "ip1_val_w" : [],
      "ip1_val_b" : [],
      "ip2_val_w" : [],
      "ip2_val_b" : []
    }
  }
}

It's then serialized to binary (with a proper header) to save, and on restore deserialized from binary. Any field missing can have a default value assigned by the engine, so if the format of the nets change, for instance, by preprocessing the batchnorm, it's possible to simply add a field indicating that in the "meta" area of the save file. This way the decision on this and perhaps other things aren't blocking on save file format.

Remi-Coulom · 2018-08-24T19:48:22Z

Hi Leela people,

If you don't mind the self-promotion, let me tell you about joedb. I use it in my generic AlphaZero system to store the training sets as well as the network weights:
https://www.remi-coulom.fr/joedb/intro.html

Joedb can manipulate vectors of numbers efficiently and conveniently:
https://www.remi-coulom.fr/joedb/vectors.html#allocating-and-using-vectors

It has convenient features for automatic schema upgrade, if you ever need to add more fields or tables, or even make any other arbitrary change:
https://www.remi-coulom.fr/joedb/schema_upgrade.html

At the moment, it is C++ only, though. So it is not a good choice if you have to manipulate data in python. But if your program is C++, joedb is a really convenient and efficient way to store binary data.

Storage is efficient (that is to say, storing a large vector of floats takes as little disk space as a raw dump of the vector + a few more bytes for the file format), and C++ code to conveniently manipulate the data is automatically generated.

I considered adding half-precision support, but decided to wait for C++ to support half. For the moment, half-precision numbers can be stored as 16-bit integers.

I'd be glad to answer questions if you wish to know more.

roy7 · 2018-08-25T13:30:27Z

Hi @Remi-Coulom! Pleasure to see you chiming in. :) Thanks for the info on joedb. Hopefully the more technical types will take a look or give it a try. @gcp is traveling this week.

gcp · 2018-09-03T11:52:09Z

The advantages don't seem that relevant to us, i.e. see first post, and the limitations are showstoppers (we need Python support, and flexible binary/float representation, etc).

To be honest I'd take something like the JSON over it if only because that's easily manipulated and well supported by literally everything. But remember that one of the main reasons to take a binary format is parsing and startup time. There's fast parsers for JSON, for sure, but it still means parsing text again... Not sure how good JSONB support is.

OmnipotentEntity · 2018-09-03T18:47:28Z

There are, of course, a few competing "standards" for binary encoded JSON-like data, and there's also the option of "rolling our own." Both are fairly easy to do, because JSON is by design easy to parse and represent in memory.

Here are some of the standards with a feature list.

Standard	C++ Implementation	Python Implementation	User Defined Types	Packed Binary Type(s)	Notes
BSON	Yes (multiple)	Yes (multiple)	Sorta	Yes	Used by MongoDB (impls gutted from MongoDB code, may be low quality)
BJSON	No	No	No	Yes	Just a website
UBJSON	Yes	Yes	No	Yes	Packed Binary is just uint8 array
MsgPack	Yes (multiple)	Yes (multiple)	Yes	Yes	Python impl is CPython
Smile	Yes	Yes	No	Yes	Seems quite old
CBOR	Yes (multiple)	Yes (multiple)	Yes (with a way to globally register new ones)	Yes	Native half float support, has an RFC, Python impl is CPython.

Of these MsgPack and CBOR seem to be the best suited.

marcocalignano · 2018-09-05T14:15:45Z

I was playing around with binary files. For these experiments I took the network 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166 with size

173M (180744218) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166
 50M (51883084) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.gz

First thing I tried was to just dump the 4 bytes long floats in a binary file in the IEEE 754 notation without spaces. I had to add a counter at the beginning of each layer to tell how many floats there are.
Of course this way we do not have any precision loss. We save almost half in file size but of course gzip is better in compressing text files than binary files.

92M (96086120) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.ieee754
52M (53482477) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.ieee754.gz

After this I applied something I already use in the past: analyzing the weight values in a layer we can find one factor per layer that map the actual floating point value to a range of floating point value between 0 and 65535. Then we can round this number up to the next integer (what I actually did was get the integer part) and save the number in a 16 bit value. Of course at the beginning of the layer with the count for the weights we have to save ONE floating point value this factor.
Of course this way we have some loss but I build a python script (so it is possible to do that in python too!) to calculate the maximal error is low. (less than 0.001)

46M (48044128) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.factor
38M (39641997) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.factor.gz

I guess to know if the errors are really low we need to run validation between the two network original and compressed.
I just wanted to know what you think of these experiments.

alreadydone · 2018-09-06T02:26:49Z

analyzing the weight values in a layer we can find one factor per layer that map the actual floating point value to a range of floating point value between 0 and 65535.

I think this is what we do to quantize the weights if we are to use 'INT16' inference.

OmnipotentEntity · 2018-09-10T16:20:36Z

Are there any objections to the CBOR format with a custom FP type that contains a user defined precision, and with the logical structure described upthread?

If not, then I'll slowly start working on this.

evdwerf · 2018-10-11T12:33:42Z

.... But remember that one of the main reasons to take a binary format is parsing and startup time.

To reduce startup time, perhaps just load the network in a separate thread? If the gtp connection doesn't block until the network is needed, users would probably perceive this as an improvement in startup time.

roy7 · 2018-10-11T13:33:29Z

Speaking of which, if the client didn't block until network is needed, that'd allow time left settings to be processed/etc before the genmove arrives. An issue I sometimes have with slow startup is if a player is doing a time control like Fischer or Byotomi with no main time and a short time period, it's possible for the bot to time out on the first move of the game since the game clock starts and time_left is sent to the bot before the network has been loaded (since bot is started when first move is needed). If startup takes 10 seconds, that 10 seconds before 'time_left' is loaded into the bot but by then the real clock is 10 seconds shorter and it might think too long.

If network loading is in a separate thread and commands like time_left can be processed immediately while network is loading, that'd make the bot's understanding of time left accurate, right? Could be a real improvement for bot operators. ;)

gcp · 2018-10-23T09:15:28Z

If startup takes 10 seconds, that 10 seconds before 'time_left' is loaded into the bot but by then the real clock is 10 seconds shorter and it might think too long.

The GUI should ping the bot (with sequenced GTP commands) to see if it is finished starting up, IMHO.

gcp · 2018-10-23T09:18:33Z

If network loading is in a separate thread and commands like time_left can be processed immediately while network is loading, that'd make the bot's understanding of time left accurate, right?

That's true, but I do really think you'll run into other problems unless you block until startup finishes. Like the bot forfeiting on time in very fast timecontrols.

users would probably perceive this as an improvement in startup time

But without a network loaded, you won't be able to interact much :-/ Maybe it's useful if you're going to send a sequence of play commands, I guess.

evdwerf · 2018-10-23T12:09:55Z

The typical sequence of GTP commands in GridMaster before starting a game is something like:

protocol_version
list_commands
name
version
boardsize ...
clear_board
komi ...
kgs-rules
time_settings

then after some time for the user to select the first move (or start the clock) we get:

play ...

and then finally where we need the network:

genmove w

So, we have in the order of 10 gtp commands that could be answered without having the network (fully) loaded. I would already be happy if only the first four (protocol_version, list_commands, name, version) would be answered instantaneously (GridMaster uses those to determine that the engine is alive and understands gtp -- this is also why on most older devices engine installs need the slow timeouts setting), but it's only a minor inconvenience. The compressed networks are a big win.

gcp · 2018-10-26T09:53:05Z

That makes sense, yes. Doing the load in a separate thread and blocking on that finishing just before we use it probably isn't too complicated, but it's going to be mostly messy due to the debug information that is printed out while the network is loaded and benchmarked.

evdwerf · 2018-12-23T17:17:17Z

I don't think it needs to get messy. Most of the initial gtp commands don't take noticeable time, so the program could just wait for a command that requires the network or the gtp input stream to go idle (and by then, if you really want to, you could still do it in a mode that blocks the main thread). Getting network loading and OpenCL tuning out of the initial startup phase would be a big win IMO.

OmnipotentEntity · 2019-07-05T04:35:22Z

Long overdue update. I became stalled looking for a good, modern, and fully featured JSON library to bring into the project. So I took some time and am making a contribution to nlohmann/json to add CBOR binary field support to the library. You can follow that here: nlohmann/json#1662

Once that is complete, I'll begin working on this a bit more earnestly.

gcp added the enhancement label Aug 21, 2018

gcp mentioned this issue Sep 5, 2018

which 40b weight should be promoted? #1798

Closed

evdwerf mentioned this issue Jan 16, 2019

GTP interface unresponsive at engine startup #2152

Open

gcp mentioned this issue May 14, 2019

Support SE nets #2370

Merged

OmnipotentEntity closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmaps for Binary File Format implementation #1751

Roadmaps for Binary File Format implementation #1751

OmnipotentEntity commented Aug 20, 2018 •

edited

OmnipotentEntity commented Aug 20, 2018

gcp commented Aug 21, 2018 •

edited

odeint commented Aug 21, 2018

gcp commented Aug 21, 2018 •

edited

OmnipotentEntity commented Aug 21, 2018 •

edited

jokkebk commented Aug 21, 2018 •

edited

gcp commented Aug 22, 2018 •

edited

marcocalignano commented Aug 23, 2018

gcp commented Aug 23, 2018

odeint commented Aug 23, 2018 •

edited

OmnipotentEntity commented Aug 24, 2018 •

edited

Remi-Coulom commented Aug 24, 2018

roy7 commented Aug 25, 2018

gcp commented Sep 3, 2018 •

edited

OmnipotentEntity commented Sep 3, 2018 •

edited

marcocalignano commented Sep 5, 2018 •

edited

alreadydone commented Sep 6, 2018

OmnipotentEntity commented Sep 10, 2018

evdwerf commented Oct 11, 2018

roy7 commented Oct 11, 2018

gcp commented Oct 23, 2018

gcp commented Oct 23, 2018

evdwerf commented Oct 23, 2018

gcp commented Oct 26, 2018

evdwerf commented Dec 23, 2018

OmnipotentEntity commented Jul 5, 2019 •

edited

Roadmaps for Binary File Format implementation #1751

Roadmaps for Binary File Format implementation #1751

Comments

OmnipotentEntity commented Aug 20, 2018 • edited

OmnipotentEntity commented Aug 20, 2018

gcp commented Aug 21, 2018 • edited

odeint commented Aug 21, 2018

gcp commented Aug 21, 2018 • edited

OmnipotentEntity commented Aug 21, 2018 • edited

jokkebk commented Aug 21, 2018 • edited

gcp commented Aug 22, 2018 • edited

marcocalignano commented Aug 23, 2018

gcp commented Aug 23, 2018

odeint commented Aug 23, 2018 • edited

OmnipotentEntity commented Aug 24, 2018 • edited

Remi-Coulom commented Aug 24, 2018

roy7 commented Aug 25, 2018

gcp commented Sep 3, 2018 • edited

OmnipotentEntity commented Sep 3, 2018 • edited

marcocalignano commented Sep 5, 2018 • edited

alreadydone commented Sep 6, 2018

OmnipotentEntity commented Sep 10, 2018

evdwerf commented Oct 11, 2018

roy7 commented Oct 11, 2018

gcp commented Oct 23, 2018

gcp commented Oct 23, 2018

evdwerf commented Oct 23, 2018

gcp commented Oct 26, 2018

evdwerf commented Dec 23, 2018

OmnipotentEntity commented Jul 5, 2019 • edited

OmnipotentEntity commented Aug 20, 2018 •

edited

gcp commented Aug 21, 2018 •

edited

gcp commented Aug 21, 2018 •

edited

OmnipotentEntity commented Aug 21, 2018 •

edited

jokkebk commented Aug 21, 2018 •

edited

gcp commented Aug 22, 2018 •

edited

odeint commented Aug 23, 2018 •

edited

OmnipotentEntity commented Aug 24, 2018 •

edited

gcp commented Sep 3, 2018 •

edited

OmnipotentEntity commented Sep 3, 2018 •

edited

marcocalignano commented Sep 5, 2018 •

edited

OmnipotentEntity commented Jul 5, 2019 •

edited