Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

varint encoding thoughts #1

Closed
raphlinus opened this issue Jun 24, 2021 · 2 comments
Closed

varint encoding thoughts #1

raphlinus opened this issue Jun 24, 2021 · 2 comments

Comments

@raphlinus
Copy link

I’m watching Behdad’s presentation on “better engineered font formats”, and finding that a lot of the discussion (in the early parts anyway) are focused on representation of integers in the file format.

File formats of TrueType’s vintage are made largely of integers with hand-chosen fixed size representations. A fixed size representation facilitates random access. But the size chosen induces a Goldilocks problem: too small imposes limits, while too large is wasteful of file size. Simply changing the 16 bit glyph id’s, array lengths, and offsets of OpenType to 32 bits would double the size of those parts of the file.

Varint encodings address this tradeoff, and are widely used. For example, LEB128, aka VByte, is used in protobuf. They are usually thought to be slower than fixed-sized encodings, and are also not thought to be compatible with random access. Both preconceptions indicate a failure of imagination.

On CPU, recent work by Daniel Lemire’s group shows that decompression of sequences of varint encoded integers can happen at very high speeds (~4 billion integers per second). LEB128 decoding is a reasonably simple and straightforward monoid, which means that it can be decoded very efficiently on GPU as well.

What about random access? That’s possible too. Acceleration structures can be built, either computed quickly in a pass over the file, or, optionally, packaged with the file.

If I were building a font file format from scratch, I would seriously consider a layered encoding strategy. The top layer would be defined in terms of a sequence of 32 bit integers. Offsets would count 32 bit values, which I expect will economize entropy. The bottom layer would just be the LEB128 encoding of that sequence, or, perhaps, with packaged acceleration structures as well.

I believe this would be efficient in space and time (and likely compress well with Brotli), but perhaps more importantly would be a conceptual simplification as it removes the need to hand-allocate bit sizes.

I will very likely use this technique in piet-gpu (right now it uses fixed-size encoding, which is wasteful of space).

@ldo
Copy link

ldo commented Jun 26, 2021

I notice encodings like LEB128 allow for redundant encodings. That is, you can add any number of leading zeroes to encode the same integer.

Let me suggest an alternative that, besides being very slightly more compact, has no redundant encodings.

@behdad
Copy link
Member

behdad commented Sep 17, 2021

Thanks Raph. In future formats I do believe using a varint encoding is the way to go; and my suggestion of using FlatBuffer or WebAssembly will do that automatically. For the Boring Expansion project though, I think it's too disruptive to switch the font format style drastically. But I'll keep in mind as I design various parts.

@behdad behdad closed this as completed Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants