You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m watching Behdad’s presentation on “better engineered font formats”, and finding that a lot of the discussion (in the early parts anyway) are focused on representation of integers in the file format.
File formats of TrueType’s vintage are made largely of integers with hand-chosen fixed size representations. A fixed size representation facilitates random access. But the size chosen induces a Goldilocks problem: too small imposes limits, while too large is wasteful of file size. Simply changing the 16 bit glyph id’s, array lengths, and offsets of OpenType to 32 bits would double the size of those parts of the file.
Varint encodings address this tradeoff, and are widely used. For example, LEB128, aka VByte, is used in protobuf. They are usually thought to be slower than fixed-sized encodings, and are also not thought to be compatible with random access. Both preconceptions indicate a failure of imagination.
On CPU, recent work by Daniel Lemire’s group shows that decompression of sequences of varint encoded integers can happen at very high speeds (~4 billion integers per second). LEB128 decoding is a reasonably simple and straightforward monoid, which means that it can be decoded very efficiently on GPU as well.
What about random access? That’s possible too. Acceleration structures can be built, either computed quickly in a pass over the file, or, optionally, packaged with the file.
If I were building a font file format from scratch, I would seriously consider a layered encoding strategy. The top layer would be defined in terms of a sequence of 32 bit integers. Offsets would count 32 bit values, which I expect will economize entropy. The bottom layer would just be the LEB128 encoding of that sequence, or, perhaps, with packaged acceleration structures as well.
I believe this would be efficient in space and time (and likely compress well with Brotli), but perhaps more importantly would be a conceptual simplification as it removes the need to hand-allocate bit sizes.
I will very likely use this technique in piet-gpu (right now it uses fixed-size encoding, which is wasteful of space).
The text was updated successfully, but these errors were encountered:
Thanks Raph. In future formats I do believe using a varint encoding is the way to go; and my suggestion of using FlatBuffer or WebAssembly will do that automatically. For the Boring Expansion project though, I think it's too disruptive to switch the font format style drastically. But I'll keep in mind as I design various parts.
I’m watching Behdad’s presentation on “better engineered font formats”, and finding that a lot of the discussion (in the early parts anyway) are focused on representation of integers in the file format.
File formats of TrueType’s vintage are made largely of integers with hand-chosen fixed size representations. A fixed size representation facilitates random access. But the size chosen induces a Goldilocks problem: too small imposes limits, while too large is wasteful of file size. Simply changing the 16 bit glyph id’s, array lengths, and offsets of OpenType to 32 bits would double the size of those parts of the file.
Varint encodings address this tradeoff, and are widely used. For example, LEB128, aka VByte, is used in protobuf. They are usually thought to be slower than fixed-sized encodings, and are also not thought to be compatible with random access. Both preconceptions indicate a failure of imagination.
On CPU, recent work by Daniel Lemire’s group shows that decompression of sequences of varint encoded integers can happen at very high speeds (~4 billion integers per second). LEB128 decoding is a reasonably simple and straightforward monoid, which means that it can be decoded very efficiently on GPU as well.
What about random access? That’s possible too. Acceleration structures can be built, either computed quickly in a pass over the file, or, optionally, packaged with the file.
If I were building a font file format from scratch, I would seriously consider a layered encoding strategy. The top layer would be defined in terms of a sequence of 32 bit integers. Offsets would count 32 bit values, which I expect will economize entropy. The bottom layer would just be the LEB128 encoding of that sequence, or, perhaps, with packaged acceleration structures as well.
I believe this would be efficient in space and time (and likely compress well with Brotli), but perhaps more importantly would be a conceptual simplification as it removes the need to hand-allocate bit sizes.
I will very likely use this technique in piet-gpu (right now it uses fixed-size encoding, which is wasteful of space).
The text was updated successfully, but these errors were encountered: