Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A very compact/flexible/fast VarInt #8

Open
MikeFair opened this issue Feb 3, 2017 · 0 comments
Open

A very compact/flexible/fast VarInt #8

MikeFair opened this issue Feb 3, 2017 · 0 comments

Comments

@MikeFair
Copy link

MikeFair commented Feb 3, 2017

I noticed the README acknowledged there still might be some time to discuss the varint spec.
So I'd like to suggest something we came up with over at the UBJSON forums.

In the format below a VarInt is either one byte or you will know right away how big the Integer will be.
Numbers below 240 are encoded directly as a single byte.
Values above 240 are used to indicate MSB or LSB, and 2, 4, or 8 bytes (max is 9 bytes total).
There are two "special case" values; 255 is for the number 255, and 254 is used for "NaN" or "null".


Having a NaN value is useful for terminating an unknown list of counted segments.
If the format was [varint]...data....[varint]...data... because you were streaming the data and the source didn't know exactly how many there were going to be before it began (like an array of strings), and "0" is considered a valid length; then NaN tells you when you've reached the end. You can also use NaN to indicate "unknown" wherever a numeric value is expected, but can't be provided or isn't known yet.


I really, really, find it impossible to comprehend counting anything larger than a 64-bit int.
Keep in mind that's what this VarInt is typically used for, counting things to process or skip over.
2^64 is 18,446,744,073,709,551,616. In bytes, that's 17,179,869,184 gigabytes.

If the number really is bigger than 64-bits, it's more like a binary data blob than a number.

If you make it arbitrary precision then a correct implementation must use an arbitrary precision library to read in and store the value. If you say "the max is a 64-bit value" then 64-bit+ machines are golden, and smaller architectures 32-bit machines and below know what they have to do and where they can stop.

Keep in mind this value is typically intended for quick processing by a CPU; either to get added to some base memory address or decremented until it hits 0. It can be used other places, but that's the most common function. Ensuring the number fits in typical CPU registers will make things go faster.


By optionally encoding MSB/LSB as part of the value, you save unnecessary overhead in most cases.
The rule is that unless you know you should change the endianness of the value, encode it based on the native format of the platform architecture you're on.

In homogeneous architectures, wire communications will just naturally happen in the native endianness.
Only when a higher power server detects it is peering with a resource constrained opposite architecture does it make sense to proactively change the endianness of the stored value. And that's not required, it's just polite. Mixed endian transmissions are no worse off than they were before, sometimes you just gotta flip the bits because the format didn't agree with the native encoding.

And of course this only matters for numbers >= 240.


Here's the proposal in a binary break down:

If the first 4 bits are all 1s, then it's a special value; otherwise take its direct value.
If the high bit on the remaining 4 bits is a one, it's an LSB value; otherwise MSB.
The next bit is unused/ignored/skipped except when part of either NaN or 255.
The final two bits indicate the length in bytes of the required integer (2^X).

x00 - xEF = take the raw value (0-239)
xF0
xF1  == MSB 2 bytes (241)
xF2  == MSB 4 bytes (242)
xF3  == MSB 8 bytes (243)
xF4
xF5
xF6
xF7
xF8
xF9  == LSB 2 bytes (249)
xFA  == LSB 4 bytes (250)
xFB  == LSB 8 bytes (251)
xFC
xFD
xFE  == NaN/Null (254)
xFF  == 255 (255)


The gaps are there for future growth but left unused to create bounded certainty for current uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant