The 1 byte header makes it impossible to encode when the input size is unknown. Please remove the header and put it at the end of the stream, similar to the padding in base64. Otherwise, keep up the good work!
A simpler scheme would be to pad with 1, 10, 100, 1000, etc. rather than with 0, 00, 000, 0000, etc. Then the padding is self-delimiting and no flag is needed at all, as long as you guarantee the use of at least one bit of padding. (This is the same padding scheme used by most hash algorithms, such as MD5 and SHA.)
@zmorris Thanks for bringing this up (and for making the issue). This makes perfect sense.
@andersk I need to think about it a little more, especially in the case when the last two-byte character would need to encode 14 bits.
In this scheme, if the last two-byte character would need to encode 14 bits, it is not the last character. You must use at least one bit of padding. So it would be followed by a one-byte character 01000000 encoding seven padding bits 1000000.
Here is another idea after reading this comment. Since there are only six illegal characters and three bits are used to encode this, there two values left to use. One of these values can be used to signify if the last two-byte character is shortened (<= 7 encoded bits), and the remainder of the two-byte character would have the encoded data.
The nice thing about this is that it removes the header, won't add another byte, and requires only a small change to the implementation (it actually shortens the decoder!)
Hi @kevinAlbs, great work and congrats on your HN appearance! +1 on getting rid of the header byte, that just ruins any kind of streaming use. There are many ways around this without a header. Probably the simplest is to use one of the two encodings you have left in the sss field, say the value 111. This value would just indicate "no character", and the final character is in the remaining bits. A nice property of your encoding is that it pretty much has a fixed expansion ratio. The encoded size can be computed from the length of the unencoded string and its last two characters, and similarly for computing the unencoded size.
With more trickery it would be possible to encode 139 possibilities per encoded byte for an encoding density of 8 bytes per 9 characters (12.5% expansion) instead of 7 bytes per 8 (14.3% expansion). However, encoding will be so much more complex and slow that it's surely not worth it.
@GeertBosch Thank you! I just pushed this commit which implements just that.
That sounds interesting, how would you achieve 139 possibilities per encoded byte?
You can change the 110sss1x 10xxxxxx format back into 110zzzzz 10zzzzzz. Of the 2048 possibilities for the value of z, the first 128 are not allowed. Define the value the first base-139 digit as c1 = 122 + (z - 128) / 112. The second base-139 digit is c2 = (z - 128) % 112 - 1. If c2 == -1 there is no second character (end of text). For value of c2 exceeding 112, use the 3-byte encoding 1110zzzz 10zzzzzz 10zzzzzz instead. We need to use the 65536 - 2048 == 63488 possibilities to encode 139 - 122 == 17 excess characters for c1, 139 -111 == 28 excess characters for c2, leaving us with 63448 / 17 / 28 == 133 possible values for c3. For the final 7 excess values, we can use the 11110zzz 10zzzzzz 10zzzzzz 10zzzzzz format. The 2031616 usable values in that format are sufficient for the 17 * 28 * 7 * 140 == 466480 needed values.
c1 = 122 + (z - 128) / 112
c2 = (z - 128) % 112 - 1
c2 == -1
1110zzzz 10zzzzzz 10zzzzzz
65536 - 2048 == 63488
139 - 122 == 17
139 -111 == 28
63448 / 17 / 28 == 133
11110zzz 10zzzzzz 10zzzzzz 10zzzzzz
17 * 28 * 7 * 140 == 466480
Besides the considerable amount of logic for excess values, there is also a significant computational overhead to convert 9 base-139 characters into 8 base-256 bytes. This was more as an exercise to look at what the boundaries are of the amount of data one can store in a UTF-8 string, when excluding a small set of byte values (6 in this case) from consideration. Eight bytes of binary data per nine UTF-8 bytes with 6 reserved values is surprisingly close to the theoretical maximum.
@GeertBosch Neat! IIUC this still only occurs when the first digit is > 122, so the additional savings will only occur then?
You'd split your binary input into base-139 digits, so savings will apply always. Even if you encounter only values < 122, you encoded log2(139) = 7.12 bits of information per byte. Of course, splitting binary input in base-139 digits (8 bytes become 9 base-139 digits) is yet more work, so not really practical. Your encoding gets quite close (7 bits per byte) and is far more efficient.
Ah yes I see what you mean. I was thinking in terms of bits, not base-139 digits. So even if the value is < 122, you're encoding the entire base-139 digit in one byte. Very cool!
I suppose I should close this issue since we've strayed a bit and it has been resolved.