Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmaps for Binary File Format implementation #1751

Closed
2 of 8 tasks
OmnipotentEntity opened this issue Aug 20, 2018 · 26 comments
Closed
2 of 8 tasks

Roadmaps for Binary File Format implementation #1751

OmnipotentEntity opened this issue Aug 20, 2018 · 26 comments

Comments

@OmnipotentEntity
Copy link
Contributor

OmnipotentEntity commented Aug 20, 2018

As mentioned in #1740, there is a number of things that @gcp wants to hit when rethinking how network weights are done and implementing a binary format.

  • Decide on a robust, extensible binary format
  • Implement conversion scripts
  • Engine Support
  • Training Support
  • Server Support
  • Deciding on a 16-bit floating point format that covers the weight range without sacrificing precision
  • Add beta and gamma weights to batchnorm layers
  • process_bn_var might be better done at writing time
@OmnipotentEntity
Copy link
Contributor Author

Re: to the 16-bit floating point format. Why not examine all of the weights and use the minimum number of exponent bits to cover the range needed? Then the rest can simply be significand? Finally, store the number of exponent bits in a nibble in the header.

@gcp
Copy link
Member

gcp commented Aug 21, 2018

Why not examine all of the weights and use the minimum number of exponent bits to cover the range needed? Then the rest can simply be significand?

Sounds like a good approach, though you will want to make sure the significand has at least, say, 7 bits of precision or so?

If you can't combine that with the exponent, you may have to round-to-zero for small values and clamp big ones.

@odeint
Copy link

odeint commented Aug 21, 2018

Decide on a robust, extensible binary format

How about using HDF5 as a container format instead of rolling our own standard? It will be slightly less efficient than pure arrays just serialised and dumped (+ some minimal metadata), but there is a lot of infrastucture around HDF5 already, and it is very extensible.

@gcp
Copy link
Member

gcp commented Aug 21, 2018

How about using HDF5

No, the HDF5 libs are not a reasonable dependency to add to Leela, for offering almost nothing of value. Leela Chess uses protobuff in some places, but I'm not too fond of that either, for similar reasons.

Note that HDF5 nor Protobuffs would not have helped with any of the points mentioned. (Aside from "extensible" where I'm not even sure what it means in this context. You're not going to extend anything if the processing code in the engine doesn't support it in the first place)

@OmnipotentEntity
Copy link
Contributor Author

OmnipotentEntity commented Aug 21, 2018

I have previously created a JSON-based binary format that I used in a game I previously published. This might be overkill, but it is an option.

By extensible, I meant that the file format can be extended to handle other situations (without having to change the file parsing code significantly, and without harming backwards compatibility), such as the ELF weight stuff the project has had to adapt to in the past.

@jokkebk
Copy link

jokkebk commented Aug 21, 2018

Files are so large that one could conceivably put in the decoding code first, then just read that in, eval it and use the function to "decompress" the data format. "What could possibly go wrong" 😄

If we'd use a separate "compress/decompress" scripts after downloading to produce a current format weight file (slightly slow, but would introduce no changes to rest of the pipeline), we could even arrange a competition out of it: Who'll make the smallest network files with reasonable packing/unpacking speed.

(just to clarify: latter is an actual idea, first paragraph is just attempt at humor)

@gcp
Copy link
Member

gcp commented Aug 22, 2018

I considered that (just wrapping the existing format with encoder/decoder scripts) because it's much more manageable than updating all the code that deals with networks.

But you can't solve the missing batchnorm layer that way, unfortunately.

@marcocalignano
Copy link
Member

We do not to reinvent the wheel, have a look at https://github.com/LLNL/zfp

@gcp
Copy link
Member

gcp commented Aug 23, 2018

Looks interesting, but I doubt our weights data has 2D spatial correlation, so not clear how well it would work.

@odeint
Copy link

odeint commented Aug 23, 2018

We do not to reinvent the wheel, have a look at https://github.com/LLNL/zfp

That looks really cool. Reading the paper and following the literature a bit, I also found this development:
https://github.com/disheng222/SZ
http://www.mcs.anl.gov/papers/P5437-1115.pdf

In the paper they explain that they try do exploit smoothness in the data (which we don't really have, at least not obviously), but in case none is found:

Effective Error-bounded Lossy Compression for Unpredictable Data. We optimize (also with O(N) time complexity) the error-bounded lossy compression for compressing the unpredictable data, by elaborately analyzing the binary data representation based on desired precision and data value ranges

That's a little bit unspecific, but it is explained later in the arXiv article (paraphrasing here):

There are three key steps to process.

  • First, we map all of the unpredictable data to a smaller range by letting all the values minus the median value of the range (denoted by med, i.e., med=(mini (ρ[i]) + maxi (ρ[i]))/2). The newly generated data after the transformation are called normalized data, which will be closer to zero than the original ones. Such a step is motivated by the fact that the closer the number is to zero, the less mantissa bits are required to meet the specified precision. [...]
  • Second, we truncate the value by disregarding the insignificant mantissa part based on the user-required error bounds. [...]
  • At the last step, we perform the leading-zero based floating-point compression method to further reduce the storage size. In particular, we perform the XOR operation for the consecutive normalized values and compress each by using a leading zero count followed by the remaining significant bits. This may significantly reduce the storage size due to the fact that the leading bits of the IEEE 754 binary representations for the similar values are expected to be very similar to each other. [...]

So basically, they also truncate, but after normalising first. In the last step they still try to profit somewhat from correlations in the data, despite this being the section about unpredictable data.

@OmnipotentEntity
Copy link
Contributor Author

OmnipotentEntity commented Aug 24, 2018

Following up on the JSON idea, because based on what I said before it might not be entirely clear what I meant, here's a text format version of what the current save file might look like:

{
  "meta" : {
    "fp_size" : 16,
    "exp_bits" : 7,
    "value_head_not_stm" : false
  },

  "weights" : {
    "blocks" : 20,
    "filters" : 256,
    "res" : {
      "conv_weights" : [[],[],[]],
      "conv_biases" : [[],[],[]],
      "batchnorm_means" : [[],[],[]],
      "batchnorm_stddevs" : [[],[],[]]
    },

    "policy_head" : {
      "conv_pol_w" : [],
      "conv_pol_b" : [],
      "bn_pol_w1" : [],
      "bn_pol_w2" : [],
      "ip_pol_w" : [],
      "ip_pol_b" : [],
      "conv_val_w" : [],
      "conv_val_b" : [],
      "bn_val_w1" : [],
      "bn_val_w2" : [],
      "ip1_val_w" : [],
      "ip1_val_b" : [],
      "ip2_val_w" : [],
      "ip2_val_b" : []
    }
  }
}

It's then serialized to binary (with a proper header) to save, and on restore deserialized from binary. Any field missing can have a default value assigned by the engine, so if the format of the nets change, for instance, by preprocessing the batchnorm, it's possible to simply add a field indicating that in the "meta" area of the save file. This way the decision on this and perhaps other things aren't blocking on save file format.

@Remi-Coulom
Copy link

Hi Leela people,

If you don't mind the self-promotion, let me tell you about joedb. I use it in my generic AlphaZero system to store the training sets as well as the network weights:
https://www.remi-coulom.fr/joedb/intro.html

Joedb can manipulate vectors of numbers efficiently and conveniently:
https://www.remi-coulom.fr/joedb/vectors.html#allocating-and-using-vectors

It has convenient features for automatic schema upgrade, if you ever need to add more fields or tables, or even make any other arbitrary change:
https://www.remi-coulom.fr/joedb/schema_upgrade.html

At the moment, it is C++ only, though. So it is not a good choice if you have to manipulate data in python. But if your program is C++, joedb is a really convenient and efficient way to store binary data.

Storage is efficient (that is to say, storing a large vector of floats takes as little disk space as a raw dump of the vector + a few more bytes for the file format), and C++ code to conveniently manipulate the data is automatically generated.

I considered adding half-precision support, but decided to wait for C++ to support half. For the moment, half-precision numbers can be stored as 16-bit integers.

I'd be glad to answer questions if you wish to know more.

@roy7
Copy link
Collaborator

roy7 commented Aug 25, 2018

Hi @Remi-Coulom! Pleasure to see you chiming in. :) Thanks for the info on joedb. Hopefully the more technical types will take a look or give it a try. @gcp is traveling this week.

@gcp
Copy link
Member

gcp commented Sep 3, 2018

The advantages don't seem that relevant to us, i.e. see first post, and the limitations are showstoppers (we need Python support, and flexible binary/float representation, etc).

To be honest I'd take something like the JSON over it if only because that's easily manipulated and well supported by literally everything. But remember that one of the main reasons to take a binary format is parsing and startup time. There's fast parsers for JSON, for sure, but it still means parsing text again... Not sure how good JSONB support is.

@OmnipotentEntity
Copy link
Contributor Author

OmnipotentEntity commented Sep 3, 2018

There are, of course, a few competing "standards" for binary encoded JSON-like data, and there's also the option of "rolling our own." Both are fairly easy to do, because JSON is by design easy to parse and represent in memory.

Here are some of the standards with a feature list.

Standard C++ Implementation Python Implementation User Defined Types Packed Binary Type(s) Notes
BSON Yes (multiple) Yes (multiple) Sorta Yes Used by MongoDB (impls gutted from MongoDB code, may be low quality)
BJSON No No No Yes Just a website
UBJSON Yes Yes No Yes Packed Binary is just uint8 array
MsgPack Yes (multiple) Yes (multiple) Yes Yes Python impl is CPython
Smile Yes Yes No Yes Seems quite old
CBOR Yes (multiple) Yes (multiple) Yes (with a way to globally register new ones) Yes Native half float support, has an RFC, Python impl is CPython.

Of these MsgPack and CBOR seem to be the best suited.

@marcocalignano
Copy link
Member

marcocalignano commented Sep 5, 2018

I was playing around with binary files. For these experiments I took the network 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166 with size

173M (180744218) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166
 50M (51883084) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.gz

First thing I tried was to just dump the 4 bytes long floats in a binary file in the IEEE 754 notation without spaces. I had to add a counter at the beginning of each layer to tell how many floats there are.
Of course this way we do not have any precision loss. We save almost half in file size but of course gzip is better in compressing text files than binary files.

92M (96086120) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.ieee754
52M (53482477) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.ieee754.gz

After this I applied something I already use in the past: analyzing the weight values in a layer we can find one factor per layer that map the actual floating point value to a range of floating point value between 0 and 65535. Then we can round this number up to the next integer (what I actually did was get the integer part) and save the number in a 16 bit value. Of course at the beginning of the layer with the count for the weights we have to save ONE floating point value this factor.
Of course this way we have some loss but I build a python script (so it is possible to do that in python too!) to calculate the maximal error is low. (less than 0.001)

46M (48044128) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.factor
38M (39641997) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.factor.gz

I guess to know if the errors are really low we need to run validation between the two network original and compressed.
I just wanted to know what you think of these experiments.

@alreadydone
Copy link
Contributor

analyzing the weight values in a layer we can find one factor per layer that map the actual floating point value to a range of floating point value between 0 and 65535.

I think this is what we do to quantize the weights if we are to use 'INT16' inference.

@OmnipotentEntity
Copy link
Contributor Author

Are there any objections to the CBOR format with a custom FP type that contains a user defined precision, and with the logical structure described upthread?

If not, then I'll slowly start working on this.

@evdwerf
Copy link

evdwerf commented Oct 11, 2018

.... But remember that one of the main reasons to take a binary format is parsing and startup time.

To reduce startup time, perhaps just load the network in a separate thread? If the gtp connection doesn't block until the network is needed, users would probably perceive this as an improvement in startup time.

@roy7
Copy link
Collaborator

roy7 commented Oct 11, 2018

Speaking of which, if the client didn't block until network is needed, that'd allow time left settings to be processed/etc before the genmove arrives. An issue I sometimes have with slow startup is if a player is doing a time control like Fischer or Byotomi with no main time and a short time period, it's possible for the bot to time out on the first move of the game since the game clock starts and time_left is sent to the bot before the network has been loaded (since bot is started when first move is needed). If startup takes 10 seconds, that 10 seconds before 'time_left' is loaded into the bot but by then the real clock is 10 seconds shorter and it might think too long.

If network loading is in a separate thread and commands like time_left can be processed immediately while network is loading, that'd make the bot's understanding of time left accurate, right? Could be a real improvement for bot operators. ;)

@gcp
Copy link
Member

gcp commented Oct 23, 2018

If startup takes 10 seconds, that 10 seconds before 'time_left' is loaded into the bot but by then the real clock is 10 seconds shorter and it might think too long.

The GUI should ping the bot (with sequenced GTP commands) to see if it is finished starting up, IMHO.

@gcp
Copy link
Member

gcp commented Oct 23, 2018

If network loading is in a separate thread and commands like time_left can be processed immediately while network is loading, that'd make the bot's understanding of time left accurate, right?

That's true, but I do really think you'll run into other problems unless you block until startup finishes. Like the bot forfeiting on time in very fast timecontrols.

users would probably perceive this as an improvement in startup time

But without a network loaded, you won't be able to interact much :-/ Maybe it's useful if you're going to send a sequence of play commands, I guess.

@evdwerf
Copy link

evdwerf commented Oct 23, 2018

The typical sequence of GTP commands in GridMaster before starting a game is something like:

protocol_version
list_commands
name
version
boardsize ...
clear_board
komi ...
kgs-rules
time_settings

then after some time for the user to select the first move (or start the clock) we get:

play ...

and then finally where we need the network:

genmove w

So, we have in the order of 10 gtp commands that could be answered without having the network (fully) loaded. I would already be happy if only the first four (protocol_version, list_commands, name, version) would be answered instantaneously (GridMaster uses those to determine that the engine is alive and understands gtp -- this is also why on most older devices engine installs need the slow timeouts setting), but it's only a minor inconvenience. The compressed networks are a big win.

@gcp
Copy link
Member

gcp commented Oct 26, 2018

That makes sense, yes. Doing the load in a separate thread and blocking on that finishing just before we use it probably isn't too complicated, but it's going to be mostly messy due to the debug information that is printed out while the network is loaded and benchmarked.

@evdwerf
Copy link

evdwerf commented Dec 23, 2018

I don't think it needs to get messy. Most of the initial gtp commands don't take noticeable time, so the program could just wait for a command that requires the network or the gtp input stream to go idle (and by then, if you really want to, you could still do it in a mode that blocks the main thread). Getting network loading and OpenCL tuning out of the initial startup phase would be a big win IMO.

@OmnipotentEntity
Copy link
Contributor Author

OmnipotentEntity commented Jul 5, 2019

Long overdue update. I became stalled looking for a good, modern, and fully featured JSON library to bring into the project. So I took some time and am making a contribution to nlohmann/json to add CBOR binary field support to the library. You can follow that here: nlohmann/json#1662

Once that is complete, I'll begin working on this a bit more earnestly.

@OmnipotentEntity OmnipotentEntity closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants