varint encoding thoughts #1

raphlinus · 2021-06-24T17:59:36Z

I’m watching Behdad’s presentation on “better engineered font formats”, and finding that a lot of the discussion (in the early parts anyway) are focused on representation of integers in the file format.

File formats of TrueType’s vintage are made largely of integers with hand-chosen fixed size representations. A fixed size representation facilitates random access. But the size chosen induces a Goldilocks problem: too small imposes limits, while too large is wasteful of file size. Simply changing the 16 bit glyph id’s, array lengths, and offsets of OpenType to 32 bits would double the size of those parts of the file.

Varint encodings address this tradeoff, and are widely used. For example, LEB128, aka VByte, is used in protobuf. They are usually thought to be slower than fixed-sized encodings, and are also not thought to be compatible with random access. Both preconceptions indicate a failure of imagination.

On CPU, recent work by Daniel Lemire’s group shows that decompression of sequences of varint encoded integers can happen at very high speeds (~4 billion integers per second). LEB128 decoding is a reasonably simple and straightforward monoid, which means that it can be decoded very efficiently on GPU as well.

What about random access? That’s possible too. Acceleration structures can be built, either computed quickly in a pass over the file, or, optionally, packaged with the file.

If I were building a font file format from scratch, I would seriously consider a layered encoding strategy. The top layer would be defined in terms of a sequence of 32 bit integers. Offsets would count 32 bit values, which I expect will economize entropy. The bottom layer would just be the LEB128 encoding of that sequence, or, perhaps, with packaged acceleration structures as well.

I believe this would be efficient in space and time (and likely compress well with Brotli), but perhaps more importantly would be a conceptual simplification as it removes the need to hand-allocate bit sizes.

I will very likely use this technique in piet-gpu (right now it uses fixed-size encoding, which is wasteful of space).

ldo · 2021-06-26T00:48:26Z

I notice encodings like LEB128 allow for redundant encodings. That is, you can add any number of leading zeroes to encode the same integer.

Let me suggest an alternative that, besides being very slightly more compact, has no redundant encodings.

behdad · 2021-09-17T12:48:21Z

Thanks Raph. In future formats I do believe using a varint encoding is the way to go; and my suggestion of using FlatBuffer or WebAssembly will do that automatically. For the Boring Expansion project though, I think it's too disruptive to switch the font format style drastically. But I'll keep in mind as I design various parts.

behdad mentioned this issue Jun 26, 2021

Combined 32bit FormatOffset with more offset bits #4

Closed

behdad added the brainstorming label Sep 15, 2021

behdad closed this as completed Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

varint encoding thoughts #1

varint encoding thoughts #1

raphlinus commented Jun 24, 2021

ldo commented Jun 26, 2021

behdad commented Sep 17, 2021

varint encoding thoughts #1

varint encoding thoughts #1

Comments

raphlinus commented Jun 24, 2021

ldo commented Jun 26, 2021

behdad commented Sep 17, 2021